i'm trying to install HDP but after finishing installation nodemanagers goes down and i can not start them again via ambari
also if they started, they goes down again after couple of minutes.
please, let me know what to do to make them work properly.
This is the only ERROR that appears in /var/log/hadoop-yarn/yarn/yarn-yarn-nodemanager-demohdpole-w-1-20180904110946.log
2018-09-05 06:30:03,105 ERROR launcher.RecoveredContainerLaunch (RecoveredContainerLaunch.java:call(98)) - Unable to recover container container_e01_1536072828773_1909_01_000001java.io.IOException: Timeout while waiting for exit code from container_e01_1536072828773_1909_01_000001 at org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor.reacquireContainer(ContainerExecutor.java:228) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:85) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:48) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)
@Alaa Nabil , The stack trace posted above is related to Nodemanager trying to recover a container. Do you know what is this application (application_1536072828773_1909) and what was the status of this application ?
What do you see on Resource Manager UI, Are there any jobs running?
It seems resources are not available on YARN, Please do check RM UI for available resources.
I have seen this kind of issue in recent days due to security breach on non secure Hadoop clusters and cluster with 8088 port open for all ip addresses.
If you find an unknown job (with user "dr.who") is eating up your cluster resources, refer below link to troubleshoot the issue.