Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Node managers in stopped state

Solved Go to solution
Highlighted

Node managers in stopped state

New Contributor

All nodemanagers go into stopped state within a couple of seconds after starting up.The nodemanager status remains active after manually starting up but still remains in stopped state.All jobs remain in accepted state.

I find the following error in nodemanager logs

2018-07-23 17:23:28,988 ERROR launcher.RecoveredContainerLaunch (RecoveredContainerLaunch.java:call(88)) - Unable to recover container container_e101_1532344242009_0069_01_000001
java.io.IOException: Timeout while waiting for exit code from container_e101_1532344242009_0069_01_000001
	at org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor.reacquireContainer(ContainerExecutor.java:205)
	at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:83)
	at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:46)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
2018-07-23 17:23:28,989 WARN  launcher.RecoveredContainerLaunch (RecoveredContainerLaunch.java:call(106)) - Recovered container exited with a non-zero exit code 154
2018-07-23 17:23:28,991 INFO  container.ContainerImpl (ContainerImpl.java:handle(1136)) - Container container_e101_1532344242009_0069_01_000001 transitioned from RUNNING to EXITED_WITH_FAILURE
2018-07-23 17:23:28,991 INFO  launcher.ContainerLaunch (ContainerLaunch.java:cleanupContainer(371)) - Cleaning up container container_e101_1532344242009_0069_01_000001
2018-07-23 17:23:29,006 ERROR launcher.RecoveredContainerLaunch (RecoveredContainerLaunch.java:call(88)) - Unable to recover container container_e101_1532344242009_0071_01_000001
1 ACCEPTED SOLUTION

Accepted Solutions

Re: Node managers in stopped state

New Contributor

I have resolved the issue.

All the resources were 100% utilised because of security breach. A cron job was using yarn service for resources.

Resolution: I closed all public ports and ip and deleted the cron jobs from /var/spool/cron/crontabs.

Fortunately it was just a test cluster and the network admin had opened the ports for a while. So don't keep any ports public in your cluster.

2 REPLIES 2

Re: Node managers in stopped state

New Contributor

I have resolved the issue.

All the resources were 100% utilised because of security breach. A cron job was using yarn service for resources.

Resolution: I closed all public ports and ip and deleted the cron jobs from /var/spool/cron/crontabs.

Fortunately it was just a test cluster and the network admin had opened the ports for a while. So don't keep any ports public in your cluster.

Re: Node managers in stopped state

New Contributor

https://community.hortonworks.com/questions/66523/yarn-node-manager-not-starting.html

If nodemanager.recovery.enabled is set to true, set it to false. (If turning off recovery is fine for you)