Created on 08-06-2018 01:12 PM - edited 09-16-2022 06:33 AM
Hello,
Ok, so i had Node manager running completly fine and suprisingly it started to crash and exited every few minutes. For instance its exited at x time and minutes, after 10-15 minutes it will be back again.
I looked up to host logs and Node manager logs specifically, i found following message related to "stop instruction by container for application xxxx"
2018-08-06 23:10:09,842 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 3048 for container-id container_1533576341741_0986_01_000001: -1B of 1 GB physical memory used; -1B of 2.1 GB virtual memory used
2018-08-06 23:10:10,178 ERROR org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch: Unable to recover container container_1533576341741_0986_01_000001
java.io.IOException: Timeout while waiting for exit code from container_1533576341741_0986_01_000001
at org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor.reacquireContainer(ContainerExecutor.java:199)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:83)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:46)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
2018-08-06 23:10:10,186 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch: Recovered container exited with a non-zero exit code 154
2018-08-06 23:10:10,191 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_1533576341741_0986_01_000001 transitioned from RUNNING to EXITED_WITH_FAILURE
2018-08-06 23:10:10,191 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: Cleaning up container container_1533576341741_0986_01_000001
2018-08-06 23:10:10,259 INFO org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Deleting absolute path : /yarn/nm/usercache/dr.who/appcache/application_1533576341741_0986/container_1533576341741_0986_01_000001
2018-08-06 23:10:10,270 WARN org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=dr.who OPERATION=Container Finished - Failed TARGET=ContainerImpl RESULT=FAILURE DESCRIPTION=Container failed with state: EXITED_WITH_FAILURE APPID=application_1533576341741_0986 CONTAINERID=container_1533576341741_0986_01_000001
2018-08-06 23:10:10,278 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_1533576341741_0986_01_000001 transitioned from EXITED_WITH_FAILURE to DONE
2018-08-06 23:10:10,279 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application: Removing container_1533576341741_0986_01_000001 from application application_1533576341741_0986
2018-08-06 23:10:10,280 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl: Considering container container_1533576341741_0986_01_000001 for log-aggregation
2018-08-06 23:10:10,280 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices: Got event CONTAINER_STOP for appId application_1533576341741_0986
2018-08-06 23:10:11,287 INFO org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Removed completed containers from NM context: [container_1533576341741_0986_01_000001]
2018-08-06 23:10:12,843 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Stopping resource-monitoring for container_1533576341741_0986_01_000001
Any one faced similar issue ? or can help me solve it ?
Thanks
Created 08-26-2018 10:41 AM
Created 08-23-2018 06:43 AM
You are probably hiting OOM, maybe overloaded system. Do you have any warnings about overcommitment (how much memory the node has for OS, YARN, Impala etc)?
Created 08-23-2018 06:45 AM
Created on 08-25-2018 11:01 PM - edited 08-25-2018 11:12 PM
Hi hadoopNoob,
I have the same problem with dr.who and unexpected exits,
could you give an idea (links or simple guidance) how your resolve it.
Thanks
6:00:25.009 AM INFO NodeStatusUpdaterImpl
Registered with ResourceManager as ip-172-31-42-197.us-west-2.compute.internal:8041 with total resource of <memory:1024, vCores:2>
6:00:25.009 AM INFO NodeStatusUpdaterImpl
Notifying ContainerManager to unblock new container-requests
6:00:25.318 AM ERROR RecoveredContainerLaunch
Unable to recover container container_1535261132868_0172_01_000001
java.io.IOException: Timeout while waiting for exit code from container_1535261132868_0172_01_000001
at org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor.reacquireContainer(ContainerExecutor.java:199)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:83)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:46)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
6:00:25.325 AM WARN RecoveredContainerLaunch
Recovered container exited with a non-zero exit code 154
6:00:25.329 AM INFO Container
Container container_1535261132868_0172_01_000001 transitioned from RUNNING to EXITED_WITH_FAILURE
6:00:25.329 AM INFO ContainerLaunch
Cleaning up container container_1535261132868_0172_01_000001
6:00:25.416 AM INFO DefaultContainerExecutor
Deleting absolute path : /yarn/nm/usercache/dr.who/appcache/application_1535261132868_0172/container_1535261132868_0172_01_000001
6:00:25.418 AM WARN NMAuditLogger
USER=dr.who OPERATION=Container Finished - Failed TARGET=ContainerImpl RESULT=FAILURE DESCRIPTION=Container failed with state: EXITED_WITH_FAILURE APPID=application_1535261132868_0172 CONTAINERID=container_1535261132868_0172_01_000001
6:00:25.423 AM INFO Container
Container container_1535261132868_0172_01_000001 transitioned from EXITED_WITH_FAILURE to DONE
6:00:25.423 AM INFO Application
Removing container_1535261132868_0172_01_000001 from application application_1535261132868_0172
6:00:25.424 AM INFO AppLogAggregatorImpl
Considering container container_1535261132868_0172_01_000001 for log-aggregation
6:00:25.424 AM INFO AuxServices
Got event CONTAINER_STOP for appId application_1535261132868_0172
6:00:25.452 AM INFO ContainersMonitorImpl
Starting resource-monitoring for container_1535261132868_0172_01_000001
6:00:25.454 AM INFO ContainersMonitorImpl
Stopping resource-monitoring for container_1535261132868_0172_01_000001
6:00:26.429 AM INFO NodeStatusUpdaterImpl
Removed completed containers from NM context: [container_1535261132868_0172_01_000001]
6:00:26.491 AM INFO ContainerManagerImpl
Start request for container_1535261132868_0174_01_000001 by user dr.who
6:00:26.492 AM INFO ContainerManagerImpl
Creating a new application reference for app application_1535261132868_0174
6:00:26.500 AM INFO Application
Application application_1535261132868_0174 transitioned from NEW to INITING
6:00:26.505 AM INFO AppLogAggregatorImpl
rollingMonitorInterval is set as -1. The log rolling monitoring interval is disabled. The logs will be aggregated after this application is finished.
6:00:26.519 AM INFO NMAuditLogger
USER=dr.who IP=172.31.35.169 OPERATION=Start Container Request TARGET=ContainerManageImpl RESULT=SUCCESS APPID=application_1535261132868_0174 CONTAINERID=container_1535261132868_0174_01_000001
6:00:26.545 AM INFO Application
Adding container_1535261132868_0174_01_000001 to application application_1535261132868_0174
6:00:26.545 AM INFO Application
Application application_1535261132868_0174 transitioned from INITING to RUNNING
6:00:26.546 AM INFO Container
Container container_1535261132868_0174_01_000001 transitioned from NEW to LOCALIZED
6:00:26.546 AM INFO AuxServices
Got event CONTAINER_INIT for appId application_1535261132868_0174
6:00:26.655 AM INFO Container
Container container_1535261132868_0174_01_000001 transitioned from LOCALIZED to RUNNING
6:00:26.678 AM INFO DefaultContainerExecutor
launchContainer: [bash, /yarn/nm/usercache/dr.who/appcache/application_1535261132868_0174/container_1535261132868_0174_01_000001/default_container_executor.sh]
6:00:26.788 AM WARN DefaultContainerExecutor
Exit code from container container_1535261132868_0174_01_000001 is : 143
6:00:26.789 AM INFO Container
Container container_1535261132868_0174_01_000001 transitioned from RUNNING to EXITED_WITH_FAILURE
6:00:26.789 AM INFO ContainerLaunch
Cleaning up container container_1535261132868_0174_01_000001
6:00:26.851 AM INFO DefaultContainerExecutor
Deleting absolute path : /yarn/nm/usercache/dr.who/appcache/application_1535261132868_0174/container_1535261132868_0174_01_000001
6:00:26.865 AM WARN NMAuditLogger
USER=dr.who OPERATION=Container Finished - Failed TARGET=ContainerImpl RESULT=FAILURE DESCRIPTION=Container failed with state: EXITED_WITH_FAILURE APPID=application_1535261132868_0174 CONTAINERID=container_1535261132868_0174_01_000001
6:00:26.865 AM INFO Container
Container container_1535261132868_0174_01_000001 transitioned from EXITED_WITH_FAILURE to DONE
6:00:26.866 AM INFO Application
Removing container_1535261132868_0174_01_000001 from application application_1535261132868_0174
6:00:26.866 AM INFO AppLogAggregatorImpl
Considering container container_1535261132868_0174_01_000001 for log-aggregation
6:00:26.866 AM INFO AuxServices
Got event CONTAINER_STOP for appId application_1535261132868_0174
Created 08-26-2018 06:59 AM
Created 08-26-2018 08:58 AM
Thanks for response, don`t know what to do.
1. Completely new cluster setup 4 nodes on Amazon WS (idle)
Cloudera 5.15.1
RedHat 7.5
Cloudera SCM with all main roles on t2.x2large -- 16 GB RAM
Data nodes t2.medium 4 GB RAM
Cluster new, without load
2. Cluster exposed to internet on ports (22,50070,19888,8042,7180,8020,7432,7182-7183,8088, ICMP)
I use inbound/outbound rules for security group on AWS (not enough)
3.
[ec2-user@ip-172-31-35-169 ~]$ sudo -u yarn crontab -l
no crontab for yarn
Thanks
Created 08-26-2018 09:04 AM
Created 08-26-2018 09:11 AM
Created 08-26-2018 10:00 AM
I set only my IP,
Doubled RAM
I know that enabling the Kerberos resolve that issue but in previous versions 5.12 (May-June) everything was working by default without problem on smaler configuration.
huh
Thanks
Created 08-26-2018 10:41 AM