Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Yarn- Node Manager unexpected exists occurring after different periods

avatar
Expert Contributor

Hello,

 

Ok, so i had Node manager running completly fine and suprisingly it started to crash and exited every few minutes. For instance its exited at x time and minutes, after 10-15 minutes it will be back again.

 

I looked up to host logs and Node manager logs specifically, i found following message related to "stop instruction by container for application xxxx"

 

2018-08-06 23:10:09,842 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 3048 for container-id container_1533576341741_0986_01_000001: -1B of 1 GB physical memory used; -1B of 2.1 GB virtual memory used
2018-08-06 23:10:10,178 ERROR org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch: Unable to recover container container_1533576341741_0986_01_000001
java.io.IOException: Timeout while waiting for exit code from container_1533576341741_0986_01_000001
at org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor.reacquireContainer(ContainerExecutor.java:199)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:83)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:46)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
2018-08-06 23:10:10,186 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch: Recovered container exited with a non-zero exit code 154
2018-08-06 23:10:10,191 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_1533576341741_0986_01_000001 transitioned from RUNNING to EXITED_WITH_FAILURE
2018-08-06 23:10:10,191 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: Cleaning up container container_1533576341741_0986_01_000001
2018-08-06 23:10:10,259 INFO org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Deleting absolute path : /yarn/nm/usercache/dr.who/appcache/application_1533576341741_0986/container_1533576341741_0986_01_000001
2018-08-06 23:10:10,270 WARN org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=dr.who OPERATION=Container Finished - Failed TARGET=ContainerImpl RESULT=FAILURE DESCRIPTION=Container failed with state: EXITED_WITH_FAILURE APPID=application_1533576341741_0986 CONTAINERID=container_1533576341741_0986_01_000001
2018-08-06 23:10:10,278 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_1533576341741_0986_01_000001 transitioned from EXITED_WITH_FAILURE to DONE
2018-08-06 23:10:10,279 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application: Removing container_1533576341741_0986_01_000001 from application application_1533576341741_0986
2018-08-06 23:10:10,280 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl: Considering container container_1533576341741_0986_01_000001 for log-aggregation
2018-08-06 23:10:10,280 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices: Got event CONTAINER_STOP for appId application_1533576341741_0986
2018-08-06 23:10:11,287 INFO org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Removed completed containers from NM context: [container_1533576341741_0986_01_000001]
2018-08-06 23:10:12,843 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Stopping resource-monitoring for container_1533576341741_0986_01_000001

 

 

Any one faced similar issue ? or can help me solve it ?

 

Thanks

1 ACCEPTED SOLUTION

avatar
Expert Contributor
hide-solution

This problem has been solved!

Want to get a detailed solution you have to login/registered on the community

Register/Login
9 REPLIES 9

avatar

You are probably hiting OOM, maybe overloaded system. Do you have any warnings about overcommitment (how much memory the node has for OS, YARN, Impala etc)?

 

avatar
Expert Contributor
I solved this, it was not memory related issue anyways. Thank you commenting. But i wonder if you can help me with oozie?

I am trying create workflow of flume with oozie, any useful examples will be help

avatar
Explorer

Hi hadoopNoob, 

I have the same problem with dr.who and unexpected exits,

could you give an idea (links or simple guidance) how your resolve it.

 

Thanks

 

6:00:25.009 AM INFO NodeStatusUpdaterImpl
Registered with ResourceManager as ip-172-31-42-197.us-west-2.compute.internal:8041 with total resource of <memory:1024, vCores:2>
6:00:25.009 AM INFO NodeStatusUpdaterImpl
Notifying ContainerManager to unblock new container-requests
6:00:25.318 AM ERROR RecoveredContainerLaunch
Unable to recover container container_1535261132868_0172_01_000001
java.io.IOException: Timeout while waiting for exit code from container_1535261132868_0172_01_000001
at org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor.reacquireContainer(ContainerExecutor.java:199)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:83)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:46)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
6:00:25.325 AM WARN RecoveredContainerLaunch
Recovered container exited with a non-zero exit code 154
6:00:25.329 AM INFO Container
Container container_1535261132868_0172_01_000001 transitioned from RUNNING to EXITED_WITH_FAILURE
6:00:25.329 AM INFO ContainerLaunch
Cleaning up container container_1535261132868_0172_01_000001
6:00:25.416 AM INFO DefaultContainerExecutor
Deleting absolute path : /yarn/nm/usercache/dr.who/appcache/application_1535261132868_0172/container_1535261132868_0172_01_000001
6:00:25.418 AM WARN NMAuditLogger
USER=dr.who OPERATION=Container Finished - Failed TARGET=ContainerImpl RESULT=FAILURE DESCRIPTION=Container failed with state: EXITED_WITH_FAILURE APPID=application_1535261132868_0172 CONTAINERID=container_1535261132868_0172_01_000001
6:00:25.423 AM INFO Container
Container container_1535261132868_0172_01_000001 transitioned from EXITED_WITH_FAILURE to DONE
6:00:25.423 AM INFO Application
Removing container_1535261132868_0172_01_000001 from application application_1535261132868_0172
6:00:25.424 AM INFO AppLogAggregatorImpl
Considering container container_1535261132868_0172_01_000001 for log-aggregation
6:00:25.424 AM INFO AuxServices
Got event CONTAINER_STOP for appId application_1535261132868_0172
6:00:25.452 AM INFO ContainersMonitorImpl
Starting resource-monitoring for container_1535261132868_0172_01_000001
6:00:25.454 AM INFO ContainersMonitorImpl
Stopping resource-monitoring for container_1535261132868_0172_01_000001
6:00:26.429 AM INFO NodeStatusUpdaterImpl
Removed completed containers from NM context: [container_1535261132868_0172_01_000001]
6:00:26.491 AM INFO ContainerManagerImpl
Start request for container_1535261132868_0174_01_000001 by user dr.who
6:00:26.492 AM INFO ContainerManagerImpl
Creating a new application reference for app application_1535261132868_0174
6:00:26.500 AM INFO Application
Application application_1535261132868_0174 transitioned from NEW to INITING
6:00:26.505 AM INFO AppLogAggregatorImpl
rollingMonitorInterval is set as -1. The log rolling monitoring interval is disabled. The logs will be aggregated after this application is finished.
6:00:26.519 AM INFO NMAuditLogger
USER=dr.who IP=172.31.35.169 OPERATION=Start Container Request TARGET=ContainerManageImpl RESULT=SUCCESS APPID=application_1535261132868_0174 CONTAINERID=container_1535261132868_0174_01_000001
6:00:26.545 AM INFO Application
Adding container_1535261132868_0174_01_000001 to application application_1535261132868_0174
6:00:26.545 AM INFO Application
Application application_1535261132868_0174 transitioned from INITING to RUNNING
6:00:26.546 AM INFO Container
Container container_1535261132868_0174_01_000001 transitioned from NEW to LOCALIZED
6:00:26.546 AM INFO AuxServices
Got event CONTAINER_INIT for appId application_1535261132868_0174
6:00:26.655 AM INFO Container
Container container_1535261132868_0174_01_000001 transitioned from LOCALIZED to RUNNING
6:00:26.678 AM INFO DefaultContainerExecutor
launchContainer: [bash, /yarn/nm/usercache/dr.who/appcache/application_1535261132868_0174/container_1535261132868_0174_01_000001/default_container_executor.sh]
6:00:26.788 AM WARN DefaultContainerExecutor
Exit code from container container_1535261132868_0174_01_000001 is : 143
6:00:26.789 AM INFO Container
Container container_1535261132868_0174_01_000001 transitioned from RUNNING to EXITED_WITH_FAILURE
6:00:26.789 AM INFO ContainerLaunch
Cleaning up container container_1535261132868_0174_01_000001
6:00:26.851 AM INFO DefaultContainerExecutor
Deleting absolute path : /yarn/nm/usercache/dr.who/appcache/application_1535261132868_0174/container_1535261132868_0174_01_000001
6:00:26.865 AM WARN NMAuditLogger
USER=dr.who OPERATION=Container Finished - Failed TARGET=ContainerImpl RESULT=FAILURE DESCRIPTION=Container failed with state: EXITED_WITH_FAILURE APPID=application_1535261132868_0174 CONTAINERID=container_1535261132868_0174_01_000001
6:00:26.865 AM INFO Container
Container container_1535261132868_0174_01_000001 transitioned from EXITED_WITH_FAILURE to DONE
6:00:26.866 AM INFO Application
Removing container_1535261132868_0174_01_000001 from application application_1535261132868_0174
6:00:26.866 AM INFO AppLogAggregatorImpl
Considering container container_1535261132868_0174_01_000001 for log-aggregation
6:00:26.866 AM INFO AuxServices
Got event CONTAINER_STOP for appId application_1535261132868_0174

avatar
Expert Contributor
Alright, few questions reply asap.

Is your cluster exposed to internet ? what kind of server you are using ?

Do you have any firewall ?

can you verify hitting this command on resource manager host and tell me what do you see "sudo -u yarn crontab -l"

avatar
Explorer

Thanks for response, don`t know what to do.

1. Completely new cluster setup 4 nodes on Amazon WS (idle)

Cloudera 5.15.1

RedHat 7.5

Cloudera SCM with all main roles on t2.x2large  -- 16 GB RAM

Data nodes t2.medium  4 GB RAM

Cluster new, without load

2. Cluster exposed to internet on ports (22,50070,19888,8042,7180,8020,7432,7182-7183,8088, ICMP)

I use inbound/outbound rules for  security group on AWS (not enough)

3. 
[ec2-user@ip-172-31-35-169 ~]$ sudo -u yarn crontab -l
no crontab for yarn

 

Thanks

 

 

 

avatar
Expert Contributor
are you sure that resource manager is on this host ? ip-172-31-35-169

avatar
Expert Contributor
restrict your cluster to only whitelisted ip, use some firewall, it will be solved

avatar
Explorer

I set only my IP,

Doubled RAM 

 

I know that enabling the Kerberos resolve that issue but in previous versions 5.12 (May-June) everything was working by default without  problem on smaler configuration.

 

 

huh

 

 

Thanks

 

avatar
Expert Contributor
hide-solution

This problem has been solved!

Want to get a detailed solution you have to login/registered on the community

Register/Login