Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Node managers in stopped state

avatar
Explorer

All nodemanagers go into stopped state within a couple of seconds after starting up.The nodemanager status remains active after manually starting up but still remains in stopped state.All jobs remain in accepted state.

I find the following error in nodemanager logs

2018-07-23 17:23:28,988 ERROR launcher.RecoveredContainerLaunch (RecoveredContainerLaunch.java:call(88)) - Unable to recover container container_e101_1532344242009_0069_01_000001
java.io.IOException: Timeout while waiting for exit code from container_e101_1532344242009_0069_01_000001
	at org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor.reacquireContainer(ContainerExecutor.java:205)
	at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:83)
	at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:46)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
2018-07-23 17:23:28,989 WARN  launcher.RecoveredContainerLaunch (RecoveredContainerLaunch.java:call(106)) - Recovered container exited with a non-zero exit code 154
2018-07-23 17:23:28,991 INFO  container.ContainerImpl (ContainerImpl.java:handle(1136)) - Container container_e101_1532344242009_0069_01_000001 transitioned from RUNNING to EXITED_WITH_FAILURE
2018-07-23 17:23:28,991 INFO  launcher.ContainerLaunch (ContainerLaunch.java:cleanupContainer(371)) - Cleaning up container container_e101_1532344242009_0069_01_000001
2018-07-23 17:23:29,006 ERROR launcher.RecoveredContainerLaunch (RecoveredContainerLaunch.java:call(88)) - Unable to recover container container_e101_1532344242009_0071_01_000001
1 ACCEPTED SOLUTION

avatar
Explorer

I have resolved the issue.

All the resources were 100% utilised because of security breach. A cron job was using yarn service for resources.

Resolution: I closed all public ports and ip and deleted the cron jobs from /var/spool/cron/crontabs.

Fortunately it was just a test cluster and the network admin had opened the ports for a while. So don't keep any ports public in your cluster.

View solution in original post

2 REPLIES 2

avatar
Explorer

I have resolved the issue.

All the resources were 100% utilised because of security breach. A cron job was using yarn service for resources.

Resolution: I closed all public ports and ip and deleted the cron jobs from /var/spool/cron/crontabs.

Fortunately it was just a test cluster and the network admin had opened the ports for a while. So don't keep any ports public in your cluster.

avatar
New Contributor

https://community.hortonworks.com/questions/66523/yarn-node-manager-not-starting.html

If nodemanager.recovery.enabled is set to true, set it to false. (If turning off recovery is fine for you)