Created 10-22-2015 07:26 AM
On relatively busy cluster, ran a huge job which consumed almost 100% resources, then during shuffle phase, it died with OOM on a NodeManager, after that, all jobs including this job are not progressing.
To recover from this state, needed to kill this job and also other jobs.
This can't reproduce at will but occasionally happens.
Have you come across any similar symptom? Is there any smarter way to recover from this state? Killing jobs manually wouldn't be ideal. Maybe need to check/modify some yarn config?
Created 10-23-2015 06:56 AM
Thanks everyone!
What Terry described looks very close to the symptom.
SmartSense has been installed, and Capacity Scheduler has been configured and will review the config.
Also will check Yarn nodemanager params
Created 10-22-2015 10:40 AM
Do you have Capacity scheduler configured? I high recommend to deploy Capacity Scheduler view and configure queues and allocate resources appropriately
Created 10-22-2015 01:18 PM
Good point! But I think if you run into an OOM issue even the Capacity Scheduler wouldnt help. Its probably a good idea to validate the Yarn configuration, especially memory settings.
Created 10-22-2015 01:40 PM
Yes, do check ALL the HDP configurations and make sure the physical memory is not overcommitted. The possibility of a rouge process consuming memory on the node is still there but that may not be in your control.
Created 10-22-2015 03:22 PM
Doesn't Yarn offer a protection mechanism against too much overcommitting?
I am thinking of the parameters:
yarn.nodemanager.pmem-check-enabled
yarn.nodemanager.vmem-check-enabled
yarn.nodemanager.vmem-pmem-ratio
Created 10-22-2015 01:56 PM
I would suggest installing SmartSense, as we have specific recommendations on optimal memory configurations for YARN, MR2, and others.
Created 10-23-2015 06:56 AM
Thanks everyone!
What Terry described looks very close to the symptom.
SmartSense has been installed, and Capacity Scheduler has been configured and will review the config.
Also will check Yarn nodemanager params