We spent almost 2 weeks diagnosing impalad and hbase RS crashes, and finally were able to figure out it was a spark job that had been changed and was failing sometimes with excessive GC issues.
The failing executors/containers were using 2,000% cpu instead of 100% or less (spark job was launched with 1 core per executor). The box cpu usage was through the roof, even though CM didn't pick it up in the per-role graph (which has lines for NodeManager cpu_system_rate and cpu_user_rate, but doesn't pick up containers i guess?):
Before the container died we were able to capture this which I guess means yarn killed it because of the excessive GC (or the jvm killed it? it was started with -XX:OnOutOfMemoryError=kill):
Here's top showing > 2000% cpu on a java process:
Which we can see it was the container for this executor, and shows specifically the --cores 1 argument:
We're on CDH 5.10 and are using both static and dynamic pools. Cgroup I/O Weight for YARN NodeManager is 35%. Container Virtual CPU Cores is set to 12 (out of 32 cores total). For dynamic all pools are using DRF scheduler, and the pool this job runs in has 130 cores out of 336 total.
So is this total breakdown of the resource allocation model a yarn bug or a limitation that one must be aware of and watch carefully?