question Re: NodeManager receives SIGKILL on CDH 5.4.1 in Archives of Support Questions (Read Only)

NodeManager receives SIGKILL on CDH 5.4.1

therealmfigura — Fri, 16 Sep 2022 10:30:04 GMT

Hi,

I am running a custom Hadoop/YARN application on a 20 node CDH 5.4.1 cluster. Every node runs NodeManager. Once in a while, some of the NodeManagers spontaneously restart. This shows up as an unexpected exit alert in Cloudera Manager.

Nothing appears in the NodeManager logs (/var/log/hadoop-yarn/) before the startup message

/var/log/cloudera-scm-agent/cloudera-scm-agent.log notes the unexpected exit, but no other information

/var/log/cloudera-scm-agent/supervisord.log notes NodeManager exited due to SIGKILL

Is there another Cloudera (or Hadoop) component that might be sending the SIGKILL besides the Cloudera agent?

Usually a group of about 5 NodeManagers restart at once. Then, no restarts for hours or days. It's not always the same nodes.

Thanks for any help!

Mark

Re: NodeManager receives SIGKILL on CDH 5.4.1

therealmfigura — Mon, 25 Jul 2016 16:20:02 GMT

Update: I've found NodeManager is being killed due to OutOfMemoryException by Cloudera's killparent.sh script. I found this by modifying killparent.sh to log a message before killing NodeManager.

We've increased the -Xmx setting for NodeManager from 1GB to 2GB and it's still happening, though less often. It's unclear why this is happening since the JVM memory usage reported through Cloudera Manager doesn't seem to be especially close to the maximum.

I suppose the next step is to enable heapdump on OOM, though this may be difficult on this production cluster...

Re: NodeManager receives SIGKILL on CDH 5.4.1

therealmfigura — Thu, 04 Aug 2016 15:16:31 GMT

Update: I was finally able to reproduce on a non-production cluster where I could enable heapdump on OOM. I found that NodeManager had some very large Strings containing the stdout/stderr of the applications it was running. The fix is to redirect stdout/stderr to /dev/null in our ContainerLaunchContext so the streams are not picked-up by NodeManager at all.