Related topic: Jobs fail in Yarn with out of Java heap memory error
... where your colleague bcwalrus said, "That [yarn.nodemanager.vmem-check-enabled] shouldn't matter though. You said that the job died due to OOME. It didn't die because it got killed by NM." Is it what happened here, too?
And what's the reason to set mapreduce.*.java.opts.max.heap in addition to mapreduce.*.memory.mb? Wouldn't it just introduce more potential conflict w/o much benefit?
We do not expose the vmem setting in Cloudera Manager since it is really troublesome to get that check correct. Depending on how the memory gets allocated the virtual memory overhead could be anywhere between 5% of the JVM size to multiple times the full JVM size.
Your container is getting killed due to the physical memory (not virtual memory) over use. the best thing is to make sure that you allow for an overhead on top of the JVM size. We normally recommend 20% of the JVM heap size as the overhead. Again this is workload dependent and could differ for you.
We are working on the change that you only need to set one of the two and fully support that in Cloudera Manager. Some of the changes have been made to the underlying MR code already via MAPREDUCE-5785...