Created 07-23-2018 12:18 PM
All nodemanagers go into stopped state within a couple of seconds after starting up.The nodemanager status remains active after manually starting up but still remains in stopped state.All jobs remain in accepted state.
I find the following error in nodemanager logs
2018-07-23 17:23:28,988 ERROR launcher.RecoveredContainerLaunch (RecoveredContainerLaunch.java:call(88)) - Unable to recover container container_e101_1532344242009_0069_01_000001 java.io.IOException: Timeout while waiting for exit code from container_e101_1532344242009_0069_01_000001 at org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor.reacquireContainer(ContainerExecutor.java:205) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:83) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:46) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) 2018-07-23 17:23:28,989 WARN launcher.RecoveredContainerLaunch (RecoveredContainerLaunch.java:call(106)) - Recovered container exited with a non-zero exit code 154 2018-07-23 17:23:28,991 INFO container.ContainerImpl (ContainerImpl.java:handle(1136)) - Container container_e101_1532344242009_0069_01_000001 transitioned from RUNNING to EXITED_WITH_FAILURE 2018-07-23 17:23:28,991 INFO launcher.ContainerLaunch (ContainerLaunch.java:cleanupContainer(371)) - Cleaning up container container_e101_1532344242009_0069_01_000001 2018-07-23 17:23:29,006 ERROR launcher.RecoveredContainerLaunch (RecoveredContainerLaunch.java:call(88)) - Unable to recover container container_e101_1532344242009_0071_01_000001
Created 07-26-2018 06:37 AM
I have resolved the issue.
All the resources were 100% utilised because of security breach. A cron job was using yarn service for resources.
Resolution: I closed all public ports and ip and deleted the cron jobs from /var/spool/cron/crontabs.
Fortunately it was just a test cluster and the network admin had opened the ports for a while. So don't keep any ports public in your cluster.
Created 07-26-2018 06:37 AM
I have resolved the issue.
All the resources were 100% utilised because of security breach. A cron job was using yarn service for resources.
Resolution: I closed all public ports and ip and deleted the cron jobs from /var/spool/cron/crontabs.
Fortunately it was just a test cluster and the network admin had opened the ports for a while. So don't keep any ports public in your cluster.
Created 11-30-2018 11:33 AM
https://community.hortonworks.com/questions/66523/yarn-node-manager-not-starting.html
If nodemanager.recovery.enabled is set to true, set it to false. (If turning off recovery is fine for you)