Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

NodeManager receives SIGKILL on CDH 5.4.1

avatar

Hi,

 

I am running a custom Hadoop/YARN application on a 20 node CDH 5.4.1 cluster. Every node runs NodeManager. Once in a while, some of the NodeManagers spontaneously restart. This shows up as an unexpected exit alert in Cloudera Manager.

 

Nothing appears in the NodeManager logs (/var/log/hadoop-yarn/) before the startup message

/var/log/cloudera-scm-agent/cloudera-scm-agent.log notes the unexpected exit, but no other information

/var/log/cloudera-scm-agent/supervisord.log notes NodeManager exited due to SIGKILL

 

Is there another Cloudera (or Hadoop) component that might be sending the SIGKILL besides the Cloudera agent?

 

Usually a group of about 5 NodeManagers restart at once. Then, no restarts for hours or days. It's not always the same nodes.

 

Thanks for any help!

Mark

1 ACCEPTED SOLUTION

avatar

Update: I was finally able to reproduce on a non-production cluster where I could enable heapdump on OOM. I found that NodeManager had some very large Strings containing the stdout/stderr of the applications it was running. The fix is to redirect stdout/stderr to /dev/null in our ContainerLaunchContext so the streams are not picked-up by NodeManager at all.

View solution in original post

2 REPLIES 2

avatar

Update: I've found NodeManager is being killed due to OutOfMemoryException by Cloudera's killparent.sh script. I found this by modifying killparent.sh to log a message before killing NodeManager.

 

We've increased the -Xmx setting for NodeManager from 1GB to 2GB and it's still happening, though less often. It's unclear why this is happening since the JVM memory usage reported through Cloudera Manager doesn't seem to be especially close to the maximum.

 

I suppose the next step is to enable heapdump on OOM, though this may be difficult on this production cluster...

avatar

Update: I was finally able to reproduce on a non-production cluster where I could enable heapdump on OOM. I found that NodeManager had some very large Strings containing the stdout/stderr of the applications it was running. The fix is to redirect stdout/stderr to /dev/null in our ContainerLaunchContext so the streams are not picked-up by NodeManager at all.