Support Questions

Find answers, ask questions, and share your expertise

Yarn- Node Manager unexpected exists occurring after different periods

avatar
Expert Contributor

Hello,

 

Ok, so i had Node manager running completly fine and suprisingly it started to crash and exited every few minutes. For instance its exited at x time and minutes, after 10-15 minutes it will be back again.

 

I looked up to host logs and Node manager logs specifically, i found following message related to "stop instruction by container for application xxxx"

 

2018-08-06 23:10:09,842 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 3048 for container-id container_1533576341741_0986_01_000001: -1B of 1 GB physical memory used; -1B of 2.1 GB virtual memory used
2018-08-06 23:10:10,178 ERROR org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch: Unable to recover container container_1533576341741_0986_01_000001
java.io.IOException: Timeout while waiting for exit code from container_1533576341741_0986_01_000001
at org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor.reacquireContainer(ContainerExecutor.java:199)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:83)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:46)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
2018-08-06 23:10:10,186 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch: Recovered container exited with a non-zero exit code 154
2018-08-06 23:10:10,191 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_1533576341741_0986_01_000001 transitioned from RUNNING to EXITED_WITH_FAILURE
2018-08-06 23:10:10,191 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: Cleaning up container container_1533576341741_0986_01_000001
2018-08-06 23:10:10,259 INFO org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Deleting absolute path : /yarn/nm/usercache/dr.who/appcache/application_1533576341741_0986/container_1533576341741_0986_01_000001
2018-08-06 23:10:10,270 WARN org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=dr.who OPERATION=Container Finished - Failed TARGET=ContainerImpl RESULT=FAILURE DESCRIPTION=Container failed with state: EXITED_WITH_FAILURE APPID=application_1533576341741_0986 CONTAINERID=container_1533576341741_0986_01_000001
2018-08-06 23:10:10,278 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_1533576341741_0986_01_000001 transitioned from EXITED_WITH_FAILURE to DONE
2018-08-06 23:10:10,279 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application: Removing container_1533576341741_0986_01_000001 from application application_1533576341741_0986
2018-08-06 23:10:10,280 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl: Considering container container_1533576341741_0986_01_000001 for log-aggregation
2018-08-06 23:10:10,280 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices: Got event CONTAINER_STOP for appId application_1533576341741_0986
2018-08-06 23:10:11,287 INFO org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Removed completed containers from NM context: [container_1533576341741_0986_01_000001]
2018-08-06 23:10:12,843 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Stopping resource-monitoring for container_1533576341741_0986_01_000001

 

 

Any one faced similar issue ? or can help me solve it ?

 

Thanks

1 ACCEPTED SOLUTION

avatar
Expert Contributor
Dr.who issue is very common these days , i am not sure whos exploiting opensource project or sth. but the main cause is usually a remote shell script would be attached to your resource manager node which cause dr.who to spawn. you dont need to kerberize just use some linux firewall Thanks

View solution in original post

9 REPLIES 9

avatar

You are probably hiting OOM, maybe overloaded system. Do you have any warnings about overcommitment (how much memory the node has for OS, YARN, Impala etc)?

 

avatar
Expert Contributor
I solved this, it was not memory related issue anyways. Thank you commenting. But i wonder if you can help me with oozie?

I am trying create workflow of flume with oozie, any useful examples will be help

avatar
Explorer

Hi hadoopNoob, 

I have the same problem with dr.who and unexpected exits,

could you give an idea (links or simple guidance) how your resolve it.

 

Thanks

 

6:00:25.009 AM INFO NodeStatusUpdaterImpl
Registered with ResourceManager as ip-172-31-42-197.us-west-2.compute.internal:8041 with total resource of <memory:1024, vCores:2>
6:00:25.009 AM INFO NodeStatusUpdaterImpl
Notifying ContainerManager to unblock new container-requests
6:00:25.318 AM ERROR RecoveredContainerLaunch
Unable to recover container container_1535261132868_0172_01_000001
java.io.IOException: Timeout while waiting for exit code from container_1535261132868_0172_01_000001
at org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor.reacquireContainer(ContainerExecutor.java:199)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:83)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:46)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
6:00:25.325 AM WARN RecoveredContainerLaunch
Recovered container exited with a non-zero exit code 154
6:00:25.329 AM INFO Container
Container container_1535261132868_0172_01_000001 transitioned from RUNNING to EXITED_WITH_FAILURE
6:00:25.329 AM INFO ContainerLaunch
Cleaning up container container_1535261132868_0172_01_000001
6:00:25.416 AM INFO DefaultContainerExecutor
Deleting absolute path : /yarn/nm/usercache/dr.who/appcache/application_1535261132868_0172/container_1535261132868_0172_01_000001
6:00:25.418 AM WARN NMAuditLogger
USER=dr.who OPERATION=Container Finished - Failed TARGET=ContainerImpl RESULT=FAILURE DESCRIPTION=Container failed with state: EXITED_WITH_FAILURE APPID=application_1535261132868_0172 CONTAINERID=container_1535261132868_0172_01_000001
6:00:25.423 AM INFO Container
Container container_1535261132868_0172_01_000001 transitioned from EXITED_WITH_FAILURE to DONE
6:00:25.423 AM INFO Application
Removing container_1535261132868_0172_01_000001 from application application_1535261132868_0172
6:00:25.424 AM INFO AppLogAggregatorImpl
Considering container container_1535261132868_0172_01_000001 for log-aggregation
6:00:25.424 AM INFO AuxServices
Got event CONTAINER_STOP for appId application_1535261132868_0172
6:00:25.452 AM INFO ContainersMonitorImpl
Starting resource-monitoring for container_1535261132868_0172_01_000001
6:00:25.454 AM INFO ContainersMonitorImpl
Stopping resource-monitoring for container_1535261132868_0172_01_000001
6:00:26.429 AM INFO NodeStatusUpdaterImpl
Removed completed containers from NM context: [container_1535261132868_0172_01_000001]
6:00:26.491 AM INFO ContainerManagerImpl
Start request for container_1535261132868_0174_01_000001 by user dr.who
6:00:26.492 AM INFO ContainerManagerImpl
Creating a new application reference for app application_1535261132868_0174
6:00:26.500 AM INFO Application
Application application_1535261132868_0174 transitioned from NEW to INITING
6:00:26.505 AM INFO AppLogAggregatorImpl
rollingMonitorInterval is set as -1. The log rolling monitoring interval is disabled. The logs will be aggregated after this application is finished.
6:00:26.519 AM INFO NMAuditLogger
USER=dr.who IP=172.31.35.169 OPERATION=Start Container Request TARGET=ContainerManageImpl RESULT=SUCCESS APPID=application_1535261132868_0174 CONTAINERID=container_1535261132868_0174_01_000001
6:00:26.545 AM INFO Application
Adding container_1535261132868_0174_01_000001 to application application_1535261132868_0174
6:00:26.545 AM INFO Application
Application application_1535261132868_0174 transitioned from INITING to RUNNING
6:00:26.546 AM INFO Container
Container container_1535261132868_0174_01_000001 transitioned from NEW to LOCALIZED
6:00:26.546 AM INFO AuxServices
Got event CONTAINER_INIT for appId application_1535261132868_0174
6:00:26.655 AM INFO Container
Container container_1535261132868_0174_01_000001 transitioned from LOCALIZED to RUNNING
6:00:26.678 AM INFO DefaultContainerExecutor
launchContainer: [bash, /yarn/nm/usercache/dr.who/appcache/application_1535261132868_0174/container_1535261132868_0174_01_000001/default_container_executor.sh]
6:00:26.788 AM WARN DefaultContainerExecutor
Exit code from container container_1535261132868_0174_01_000001 is : 143
6:00:26.789 AM INFO Container
Container container_1535261132868_0174_01_000001 transitioned from RUNNING to EXITED_WITH_FAILURE
6:00:26.789 AM INFO ContainerLaunch
Cleaning up container container_1535261132868_0174_01_000001
6:00:26.851 AM INFO DefaultContainerExecutor
Deleting absolute path : /yarn/nm/usercache/dr.who/appcache/application_1535261132868_0174/container_1535261132868_0174_01_000001
6:00:26.865 AM WARN NMAuditLogger
USER=dr.who OPERATION=Container Finished - Failed TARGET=ContainerImpl RESULT=FAILURE DESCRIPTION=Container failed with state: EXITED_WITH_FAILURE APPID=application_1535261132868_0174 CONTAINERID=container_1535261132868_0174_01_000001
6:00:26.865 AM INFO Container
Container container_1535261132868_0174_01_000001 transitioned from EXITED_WITH_FAILURE to DONE
6:00:26.866 AM INFO Application
Removing container_1535261132868_0174_01_000001 from application application_1535261132868_0174
6:00:26.866 AM INFO AppLogAggregatorImpl
Considering container container_1535261132868_0174_01_000001 for log-aggregation
6:00:26.866 AM INFO AuxServices
Got event CONTAINER_STOP for appId application_1535261132868_0174

avatar
Expert Contributor
Alright, few questions reply asap.

Is your cluster exposed to internet ? what kind of server you are using ?

Do you have any firewall ?

can you verify hitting this command on resource manager host and tell me what do you see "sudo -u yarn crontab -l"

avatar
Explorer

Thanks for response, don`t know what to do.

1. Completely new cluster setup 4 nodes on Amazon WS (idle)

Cloudera 5.15.1

RedHat 7.5

Cloudera SCM with all main roles on t2.x2large  -- 16 GB RAM

Data nodes t2.medium  4 GB RAM

Cluster new, without load

2. Cluster exposed to internet on ports (22,50070,19888,8042,7180,8020,7432,7182-7183,8088, ICMP)

I use inbound/outbound rules for  security group on AWS (not enough)

3. 
[ec2-user@ip-172-31-35-169 ~]$ sudo -u yarn crontab -l
no crontab for yarn

 

Thanks

 

 

 

avatar
Expert Contributor
are you sure that resource manager is on this host ? ip-172-31-35-169

avatar
Expert Contributor
restrict your cluster to only whitelisted ip, use some firewall, it will be solved

avatar
Explorer

I set only my IP,

Doubled RAM 

 

I know that enabling the Kerberos resolve that issue but in previous versions 5.12 (May-June) everything was working by default without  problem on smaler configuration.

 

 

huh

 

 

Thanks

 

avatar
Expert Contributor
Dr.who issue is very common these days , i am not sure whos exploiting opensource project or sth. but the main cause is usually a remote shell script would be attached to your resource manager node which cause dr.who to spawn. you dont need to kerberize just use some linux firewall Thanks