I am using CDH 5.3.2. I am unable to set value for yarn.nodemanager.pmem-check-enabled through UI. I could add following to ResourceManager Advanced Configuration Snippet (Safety Valve) for yarn-site.xml and restarted it
<property> <name>yarn.nodemanager.pmem-check-enabled</name> <value>false</value> </property>
However, I still see my apps getting killed due to physical limits being breached:
2015-07-27 18:53:46,528 [AMRM Callback Handler Thread] INFO HoyaAppMaster.yarn (HoyaAppMaster.java:onContainersCompleted(847)) - Container Completion for containerID=container_1437726395811_0116_01_000002, state=COMPLETE, exitStatus=-104, diagnostics=Container [pid=36891,containerID=container_1437726395811_0116_01_000002] is running beyond physical memory limits. Current usage: 1.0 GB of 1 GB physical memory used; 2.8 GB of 2.1 GB virtual memory used. Killing container.
Set it through the NodeManager yarn-site.xml configuration snippet.
You used the ResourceManager snippet and the check is not performed on that service that is why it did not work for you.
This does not seem to have worked with a latter version of CDH (5.13.1). There we had to set this through -
YARN Client Advanced Configuration Snippet (Safety Valve) for yarn-site.xml
So, what is the correct way to set this? Is this really changed with newer releases?
Related topic: Jobs fail in Yarn with out of Java heap memory error
... where your colleague bcwalrus said, "That [yarn.nodemanager.vmem-check-enabled] shouldn't matter though. You said that the job died due to OOME. It didn't die because it got killed by NM." Is it what happened here, too?
And what's the reason to set mapreduce.*.java.opts.max.heap in addition to mapreduce.*.memory.mb? Wouldn't it just introduce more potential conflict w/o much benefit?
We do not expose the vmem setting in Cloudera Manager since it is really troublesome to get that check correct. Depending on how the memory gets allocated the virtual memory overhead could be anywhere between 5% of the JVM size to multiple times the full JVM size.
Your container is getting killed due to the physical memory (not virtual memory) over use. the best thing is to make sure that you allow for an overhead on top of the JVM size. We normally recommend 20% of the JVM heap size as the overhead. Again this is workload dependent and could differ for you.
We are working on the change that you only need to set one of the two and fully support that in Cloudera Manager. Some of the changes have been made to the underlying MR code already via MAPREDUCE-5785...