Created on 07-14-2016 09:34 AM - edited 09-16-2022 03:30 AM
Hi,
I am running a custom Hadoop/YARN application on a 20 node CDH 5.4.1 cluster. Every node runs NodeManager. Once in a while, some of the NodeManagers spontaneously restart. This shows up as an unexpected exit alert in Cloudera Manager.
Nothing appears in the NodeManager logs (/var/log/hadoop-yarn/) before the startup message
/var/log/cloudera-scm-agent/cloudera-scm-agent.log notes the unexpected exit, but no other information
/var/log/cloudera-scm-agent/supervisord.log notes NodeManager exited due to SIGKILL
Is there another Cloudera (or Hadoop) component that might be sending the SIGKILL besides the Cloudera agent?
Usually a group of about 5 NodeManagers restart at once. Then, no restarts for hours or days. It's not always the same nodes.
Thanks for any help!
Mark
Created 08-04-2016 08:16 AM
Update: I was finally able to reproduce on a non-production cluster where I could enable heapdump on OOM. I found that NodeManager had some very large Strings containing the stdout/stderr of the applications it was running. The fix is to redirect stdout/stderr to /dev/null in our ContainerLaunchContext so the streams are not picked-up by NodeManager at all.
Created on 07-25-2016 09:00 AM - edited 07-25-2016 09:20 AM
Update: I've found NodeManager is being killed due to OutOfMemoryException by Cloudera's killparent.sh script. I found this by modifying killparent.sh to log a message before killing NodeManager.
We've increased the -Xmx setting for NodeManager from 1GB to 2GB and it's still happening, though less often. It's unclear why this is happening since the JVM memory usage reported through Cloudera Manager doesn't seem to be especially close to the maximum.
I suppose the next step is to enable heapdump on OOM, though this may be difficult on this production cluster...
Created 08-04-2016 08:16 AM
Update: I was finally able to reproduce on a non-production cluster where I could enable heapdump on OOM. I found that NodeManager had some very large Strings containing the stdout/stderr of the applications it was running. The fix is to redirect stdout/stderr to /dev/null in our ContainerLaunchContext so the streams are not picked-up by NodeManager at all.