Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

One dead big job blocks all jobs

avatar

On relatively busy cluster, ran a huge job which consumed almost 100% resources, then during shuffle phase, it died with OOM on a NodeManager, after that, all jobs including this job are not progressing.

To recover from this state, needed to kill this job and also other jobs.

This can't reproduce at will but occasionally happens.

Have you come across any similar symptom? Is there any smarter way to recover from this state? Killing jobs manually wouldn't be ideal. Maybe need to check/modify some yarn config?

1 ACCEPTED SOLUTION

avatar

Thanks everyone!

What Terry described looks very close to the symptom.

SmartSense has been installed, and Capacity Scheduler has been configured and will review the config.

Also will check Yarn nodemanager params

View solution in original post

6 REPLIES 6

avatar
Master Mentor

@hosako@hortonworks.com

Do you have Capacity scheduler configured? I high recommend to deploy Capacity Scheduler view and configure queues and allocate resources appropriately

Link

avatar

Good point! But I think if you run into an OOM issue even the Capacity Scheduler wouldnt help. Its probably a good idea to validate the Yarn configuration, especially memory settings.

avatar
Super Collaborator

Yes, do check ALL the HDP configurations and make sure the physical memory is not overcommitted. The possibility of a rouge process consuming memory on the node is still there but that may not be in your control.

avatar
Super Collaborator

Doesn't Yarn offer a protection mechanism against too much overcommitting?

I am thinking of the parameters:

yarn.nodemanager.pmem-check-enabled

yarn.nodemanager.vmem-check-enabled

yarn.nodemanager.vmem-pmem-ratio

avatar

I would suggest installing SmartSense, as we have specific recommendations on optimal memory configurations for YARN, MR2, and others.

avatar

Thanks everyone!

What Terry described looks very close to the symptom.

SmartSense has been installed, and Capacity Scheduler has been configured and will review the config.

Also will check Yarn nodemanager params