Node managers in stopped state

andy — Mon, 23 Jul 2018 19:18:31 GMT

All nodemanagers go into stopped state within a couple of seconds after starting up.The nodemanager status remains active after manually starting up but still remains in stopped state.All jobs remain in accepted state.

I find the following error in nodemanager logs

2018-07-23 17:23:28,988 ERROR launcher.RecoveredContainerLaunch (RecoveredContainerLaunch.java:call(88)) - Unable to recover container container_e101_1532344242009_0069_01_000001
java.io.IOException: Timeout while waiting for exit code from container_e101_1532344242009_0069_01_000001
	at org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor.reacquireContainer(ContainerExecutor.java:205)
	at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:83)
	at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:46)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
2018-07-23 17:23:28,989 WARN  launcher.RecoveredContainerLaunch (RecoveredContainerLaunch.java:call(106)) - Recovered container exited with a non-zero exit code 154
2018-07-23 17:23:28,991 INFO  container.ContainerImpl (ContainerImpl.java:handle(1136)) - Container container_e101_1532344242009_0069_01_000001 transitioned from RUNNING to EXITED_WITH_FAILURE
2018-07-23 17:23:28,991 INFO  launcher.ContainerLaunch (ContainerLaunch.java:cleanupContainer(371)) - Cleaning up container container_e101_1532344242009_0069_01_000001
2018-07-23 17:23:29,006 ERROR launcher.RecoveredContainerLaunch (RecoveredContainerLaunch.java:call(88)) - Unable to recover container container_e101_1532344242009_0071_01_000001

Re: Node managers in stopped state

andy — Thu, 26 Jul 2018 13:37:26 GMT

I have resolved the issue.

All the resources were 100% utilised because of security breach. A cron job was using yarn service for resources.

Resolution: I closed all public ports and ip and deleted the cron jobs from /var/spool/cron/crontabs.

Fortunately it was just a test cluster and the network admin had opened the ports for a while. So don't keep any ports public in your cluster.

Re: Node managers in stopped state

nitin_s_a_svp — Fri, 30 Nov 2018 19:33:37 GMT

https://community.hortonworks.com/questions/66523/yarn-node-manager-not-starting.html

If nodemanager.recovery.enabled is set to true, set it to false. (If turning off recovery is fine for you)

question Re: Node managers in stopped state in Support Questions

Node managers in stopped state

Re: Node managers in stopped state

Re: Node managers in stopped state