Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

One dead big job blocks all jobs

Solved Go to solution

One dead big job blocks all jobs

On relatively busy cluster, ran a huge job which consumed almost 100% resources, then during shuffle phase, it died with OOM on a NodeManager, after that, all jobs including this job are not progressing.

To recover from this state, needed to kill this job and also other jobs.

This can't reproduce at will but occasionally happens.

Have you come across any similar symptom? Is there any smarter way to recover from this state? Killing jobs manually wouldn't be ideal. Maybe need to check/modify some yarn config?

1 ACCEPTED SOLUTION

Accepted Solutions

Re: One dead big job blocks all jobs

Thanks everyone!

What Terry described looks very close to the symptom.

SmartSense has been installed, and Capacity Scheduler has been configured and will review the config.

Also will check Yarn nodemanager params

6 REPLIES 6

Re: One dead big job blocks all jobs

@hosako@hortonworks.com

Do you have Capacity scheduler configured? I high recommend to deploy Capacity Scheduler view and configure queues and allocate resources appropriately

Link

Re: One dead big job blocks all jobs

Good point! But I think if you run into an OOM issue even the Capacity Scheduler wouldnt help. Its probably a good idea to validate the Yarn configuration, especially memory settings.

Re: One dead big job blocks all jobs

Expert Contributor

Yes, do check ALL the HDP configurations and make sure the physical memory is not overcommitted. The possibility of a rouge process consuming memory on the node is still there but that may not be in your control.

Re: One dead big job blocks all jobs

Expert Contributor

Doesn't Yarn offer a protection mechanism against too much overcommitting?

I am thinking of the parameters:

yarn.nodemanager.pmem-check-enabled

yarn.nodemanager.vmem-check-enabled

yarn.nodemanager.vmem-pmem-ratio

Re: One dead big job blocks all jobs

Guru

I would suggest installing SmartSense, as we have specific recommendations on optimal memory configurations for YARN, MR2, and others.

Re: One dead big job blocks all jobs

Thanks everyone!

What Terry described looks very close to the symptom.

SmartSense has been installed, and Capacity Scheduler has been configured and will review the config.

Also will check Yarn nodemanager params

Don't have an account?
Coming from Hortonworks? Activate your account here