One thing I forgot to say that you could do, is to check the memory + disk usage of nodes running node manager. Disk space/file descriptor/memory could fill up and lead to Node Manager to shut down because the file deletion problem.
I checked the memory on both nodes and they both spike to 97.78% with all 10 containers running for about 10 minutes. I couldn't look at the file descriptors though. But, all metrics spike the same: GC, CPU usage, disk latency, network throughput - while the metrics for JVM Heap Memory Usage and Java Threads disappear off the charts. The 4x2TB disks on each node are barely used because the amount of data is small, ~57GB. Do you have an idea of what to configure or change to fix this somehow? And why does CDH 5.5.2 do this and not earlier versions? Also, the 'ulimit -n' is 65536. I don't know if this helps.
The reason of high memory cpu usage in this case might not be exactly the same as what caused your original problem of jobs being in stuck. I see a lot of recovered stuff in Node Manager. I guess because all the containers were not cleaned up previously, so they were recovered when you restarted Node Manager, taking a lot of resources. But they stil cannot be cleaned up because of the same reason which caused containers to be stuck in previous run. I suspect it might have something to do with the cgroup setup, though I have no knowlege of how cgroup is used or set up in CDH. I have seen cgroup issues in the Node Manager log consistently, which might have lead to the failure of resource cleanup of the containers. Therefore, containers can never be claimed back, and they will stay in state store for recovery next time when Node Manager comes back.
Are you saying that the completed tasks (containers) are not being cleaned up so new ones cannot be allocated to other tasks? If so, is there something that addresses this somewhere? Has anyone else heard of this?
These are the settings in YARN that we have set regarding containers. Do you see anything out of the ordinary?
I'm afraid that I cannot help you with cgroup as I don't know how cgroup works with YARN (Just started working on YARN not long ago). Reading the apache doc, you may try to verify if hadoop-yarn cgroup hierarchy exists and try to set yarn.nodemanager.linux-container-executor.nonsecure-mode.local-user to the desired user as described in section CGroups and Security of https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/NodeManagerCgroups.html#CGroups_a...
If this is of any help, for both clusters, I had to turn off Log Aggregation because none of the jobs would start. They would be stuck in pending state and never start. Can this be a symptom?
You can upload a container log so we can verify from the container's perspective what was happening.