04-25-2016 07:57 PM
I brought the QA cluster down to 2 NodeManagers. Here are the log files for those 2 after the job stalled.
04-25-2016 09:23 PM
One thing I forgot to say that you could do, is to check the memory + disk usage of nodes running node manager. Disk space/file descriptor/memory could fill up and lead to Node Manager to shut down because the file deletion problem.
04-25-2016 10:27 PM
I checked the memory on both nodes and they both spike to 97.78% with all 10 containers running for about 10 minutes. I couldn't look at the file descriptors though. But, all metrics spike the same: GC, CPU usage, disk latency, network throughput - while the metrics for JVM Heap Memory Usage and Java Threads disappear off the charts. The 4x2TB disks on each node are barely used because the amount of data is small, ~57GB. Do you have an idea of what to configure or change to fix this somehow? And why does CDH 5.5.2 do this and not earlier versions? Also, the 'ulimit -n' is 65536. I don't know if this helps.
04-26-2016 10:19 AM
The reason of high memory cpu usage in this case might not be exactly the same as what caused your original problem of jobs being in stuck. I see a lot of recovered stuff in Node Manager. I guess because all the containers were not cleaned up previously, so they were recovered when you restarted Node Manager, taking a lot of resources. But they stil cannot be cleaned up because of the same reason which caused containers to be stuck in previous run. I suspect it might have something to do with the cgroup setup, though I have no knowlege of how cgroup is used or set up in CDH. I have seen cgroup issues in the Node Manager log consistently, which might have lead to the failure of resource cleanup of the containers. Therefore, containers can never be claimed back, and they will stay in state store for recovery next time when Node Manager comes back.
04-26-2016 11:06 AM
Are you saying that the completed tasks (containers) are not being cleaned up so new ones cannot be allocated to other tasks? If so, is there something that addresses this somewhere? Has anyone else heard of this?
04-26-2016 11:18 AM
Yes, it is very likely. I haven't checked out how Node Manager handles clean up failure exactly in the code though. You could read CGroup with Yarn to verify some of the settings.
04-26-2016 11:34 AM
These are the settings in YARN that we have set regarding containers. Do you see anything out of the ordinary?
04-26-2016 11:52 AM
I'm afraid that I cannot help you with cgroup as I don't know how cgroup works with YARN (Just started working on YARN not long ago). Reading the apache doc, you may try to verify if hadoop-yarn cgroup hierarchy exists and try to set yarn.nodemanager.linux-container-executor.nonsecure-mode.local-user to the desired user as described in section CGroups and Security of https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/NodeManagerCgroups.html#CGroups_a...