question Re: One dead big job blocks all jobs in Support Questions

One dead big job blocks all jobs

hosako — Thu, 22 Oct 2015 14:26:41 GMT

On relatively busy cluster, ran a huge job which consumed almost 100% resources, then during shuffle phase, it died with OOM on a NodeManager, after that, all jobs including this job are not progressing.

To recover from this state, needed to kill this job and also other jobs.

This can't reproduce at will but occasionally happens.

Have you come across any similar symptom? Is there any smarter way to recover from this state? Killing jobs manually wouldn't be ideal. Maybe need to check/modify some yarn config?

Re: One dead big job blocks all jobs

nsabharwal — Thu, 22 Oct 2015 17:40:22 GMT

@hosako@hortonworks.com

Do you have Capacity scheduler configured? I high recommend to deploy Capacity Scheduler view and configure queues and allocate resources appropriately

Link

Re: One dead big job blocks all jobs

jstraub — Thu, 22 Oct 2015 20:18:10 GMT

Good point! But I think if you run into an OOM issue even the Capacity Scheduler wouldnt help. Its probably a good idea to validate the Yarn configuration, especially memory settings.

Re: One dead big job blocks all jobs

TerryP — Thu, 22 Oct 2015 20:40:50 GMT

Yes, do check ALL the HDP configurations and make sure the physical memory is not overcommitted. The possibility of a rouge process consuming memory on the node is still there but that may not be in your control.

Re: One dead big job blocks all jobs

pcodding — Thu, 22 Oct 2015 20:56:26 GMT

I would suggest installing SmartSense, as we have specific recommendations on optimal memory configurations for YARN, MR2, and others.

Re: One dead big job blocks all jobs

sluangsay — Thu, 22 Oct 2015 22:22:40 GMT

Doesn't Yarn offer a protection mechanism against too much overcommitting?

I am thinking of the parameters:

yarn.nodemanager.pmem-check-enabled

yarn.nodemanager.vmem-check-enabled

yarn.nodemanager.vmem-pmem-ratio

Re: One dead big job blocks all jobs

hosako — Fri, 23 Oct 2015 13:56:26 GMT

Thanks everyone!

What Terry described looks very close to the symptom.

SmartSense has been installed, and Capacity Scheduler has been configured and will review the config.

Also will check Yarn nodemanager params