Support Questions

hosako · ‎10-22-2015

On relatively busy cluster, ran a huge job which consumed almost 100% resources, then during shuffle phase, it died with OOM on a NodeManager, after that, all jobs including this job are not progressing.

To recover from this state, needed to kill this job and also other jobs.

This can't reproduce at will but occasionally happens.

Have you come across any similar symptom? Is there any smarter way to recover from this state? Killing jobs manually wouldn't be ideal. Maybe need to check/modify some yarn config?

hosako · ‎10-23-2015

Thanks everyone!

What Terry described looks very close to the symptom.

SmartSense has been installed, and Capacity Scheduler has been configured and will review the config.

Also will check Yarn nodemanager params

View solution in original post

nsabharwal · ‎10-22-2015

@hosako@hortonworks.com

Do you have Capacity scheduler configured? I high recommend to deploy Capacity Scheduler view and configure queues and allocate resources appropriately

Link

jstraub · ‎10-22-2015

Good point! But I think if you run into an OOM issue even the Capacity Scheduler wouldnt help. Its probably a good idea to validate the Yarn configuration, especially memory settings.

TerryP · ‎10-22-2015

Yes, do check ALL the HDP configurations and make sure the physical memory is not overcommitted. The possibility of a rouge process consuming memory on the node is still there but that may not be in your control.

sluangsay · ‎10-22-2015

Doesn't Yarn offer a protection mechanism against too much overcommitting?

I am thinking of the parameters:

yarn.nodemanager.pmem-check-enabled

yarn.nodemanager.vmem-check-enabled

yarn.nodemanager.vmem-pmem-ratio

pcodding · ‎10-22-2015

I would suggest installing SmartSense, as we have specific recommendations on optimal memory configurations for YARN, MR2, and others.

hosako · ‎10-23-2015

Thanks everyone!

What Terry described looks very close to the symptom.

SmartSense has been installed, and Capacity Scheduler has been configured and will review the config.

Also will check Yarn nodemanager params

Cloudera Community

Support Questions

One dead big job blocks all jobs