Created 04-10-2017 05:36 PM
Greetings,
We are running a 10-datanode HDP v2.5 cluster on Ubuntu 14.04 installed by ambari. Whenever I run a large yarn job I get the following result:
Container killed by the ApplicationMaster. Container killed on request. Exit code is 143 Container exited with a non-zero exit code 143
I'm not sure what is causing this problem. Can someone help troubleshoot? Here is the yarn-yarn-nodemanager-datanode1.log:
2017-04-03 10:15:18,140 INFO containermanager.ContainerManagerImpl (ContainerManagerImpl.java:startContainerInternal(810)) - Start request for container_e10_1484675915702_18333_01_000003 by user root 2017-04-03 10:15:18,151 INFO application.ApplicationImpl (ApplicationImpl.java:transition(304)) - Adding container_e10_1484675915702_18333_01_000003 to application application_1484675915702_18333 2017-04-03 10:15:18,153 INFO container.ContainerImpl (ContainerImpl.java:handle(1163)) - Container container_e10_1484675915702_18333_01_000003 transitioned from NEW to LOCALIZING 2017-04-03 10:15:18,157 INFO yarn.YarnShuffleService (YarnShuffleService.java:initializeContainer(184)) - Initializing container container_e10_1484675915702_18333_01_000003 2017-04-03 10:15:18,157 INFO yarn.YarnShuffleService (YarnShuffleService.java:initializeContainer(185)) - Initializing container container_e10_1484675915702_18333_01_000003 2017-04-03 10:15:18,358 INFO localizer.ResourceLocalizationService (ResourceLocalizationService.java:handle(712)) - Created localizer for container_e10_1484675915702_18333_01_000003 2017-04-03 10:15:18,406 INFO localizer.ResourceLocalizationService (ResourceLocalizationService.java:writeCredentials(1194)) - Writing credentials to the nmPrivate file /grid/3/hadoop/yarn/local/nmPrivate/container_e10_1484675915702_18333_01_000003.tokens. Credentials list: 2017-04-03 10:15:18,407 INFO container.ContainerImpl (ContainerImpl.java:handle(1163)) - Container container_e10_1484675915702_18333_01_000003 transitioned from LOCALIZING to LOCALIZED 2017-04-03 10:15:18,458 INFO container.ContainerImpl (ContainerImpl.java:handle(1163)) - Container container_e10_1484675915702_18333_01_000003 transitioned from LOCALIZED to RUNNING 2017-04-03 10:15:18,462 INFO nodemanager.DefaultContainerExecutor (DefaultContainerExecutor.java:buildCommandExecutor(281)) - launchContainer: [bash, /grid/1/hadoop/yarn/local/usercache/root/appcache/application_1484675915702_18333/container_e10_1484675915702_18333_01_000003/default_container_executor.sh] 2017-04-03 10:15:18,465 INFO nodemanager.DefaultContainerExecutor (DefaultContainerExecutor.java:startLocalizer(126)) - Copying from /grid/3/hadoop/yarn/local/nmPrivate/container_e10_1484675915702_18333_01_000003.tokens to /grid/2/hadoop/yarn/local/usercache/root/appcache/application_1484675915702_18333/container_e10_1484675915702_18333_01_000003.tokens 2017-04-03 10:15:20,998 INFO monitor.ContainersMonitorImpl (ContainersMonitorImpl.java:run(375)) - Starting resource-monitoring for container_e10_1484675915702_18333_01_000003 2017-04-03 10:15:21,144 INFO monitor.ContainersMonitorImpl (ContainersMonitorImpl.java:run(464)) - Memory usage of ProcessTree 851 for container-id container_e10_1484675915702_18333_01_000003: 148.7 MB of 2 GB physical memory used; 2.1 GB of 4.2 GB virtual memory used 2017-04-03 10:15:24,293 INFO monitor.ContainersMonitorImpl (ContainersMonitorImpl.java:run(464)) - Memory usage of ProcessTree 851 for container-id container_e10_1484675915702_18333_01_000003: 305.4 MB of 2 GB physical memory used; 2.4 GB of 4.2 GB virtual memory used 2017-04-03 10:15:24,734 INFO containermanager.ContainerManagerImpl (ContainerManagerImpl.java:stopContainerInternal(960)) - Stopping container with container Id: container_e10_1484675915702_18333_01_000003 2017-04-03 10:15:24,734 INFO container.ContainerImpl (ContainerImpl.java:handle(1163)) - Container container_e10_1484675915702_18333_01_000003 transitioned from RUNNING to KILLING 2017-04-03 10:15:24,734 INFO launcher.ContainerLaunch (ContainerLaunch.java:cleanupContainer(425)) - Cleaning up container container_e10_1484675915702_18333_01_000003 2017-04-03 10:15:24,743 WARN nodemanager.DefaultContainerExecutor (DefaultContainerExecutor.java:launchContainer(237)) - Exit code from container container_e10_1484675915702_18333_01_000003 is : 143 2017-04-03 10:15:24,756 INFO container.ContainerImpl (ContainerImpl.java:handle(1163)) - Container container_e10_1484675915702_18333_01_000003 transitioned from KILLING to CONTAINER_CLEANEDUP_AFTER_KILL 2017-04-03 10:15:24,757 INFO nodemanager.DefaultContainerExecutor (DefaultContainerExecutor.java:deleteAsUser(480)) - Deleting absolute path : /grid/1/hadoop/yarn/local/usercache/root/appcache/application_1484675915702_18333/container_e10_1484675915702_18333_01_000003 2017-04-03 10:15:24,757 INFO nodemanager.DefaultContainerExecutor (DefaultContainerExecutor.java:deleteAsUser(480)) - Deleting absolute path : /grid/2/hadoop/yarn/local/usercache/root/appcache/application_1484675915702_18333/container_e10_1484675915702_18333_01_000003 2017-04-03 10:15:24,757 INFO nodemanager.DefaultContainerExecutor (DefaultContainerExecutor.java:deleteAsUser(480)) - Deleting absolute path : /grid/3/hadoop/yarn/local/usercache/root/appcache/application_1484675915702_18333/container_e10_1484675915702_18333_01_000003 2017-04-03 10:15:24,757 INFO nodemanager.DefaultContainerExecutor (DefaultContainerExecutor.java:deleteAsUser(480)) - Deleting absolute path : /grid/0/hadoop/yarn/local/usercache/root/appcache/application_1484675915702_18333/container_e10_1484675915702_18333_01_000003 2017-04-03 10:15:24,757 INFO container.ContainerImpl (ContainerImpl.java:handle(1163)) - Container container_e10_1484675915702_18333_01_000003 transitioned from CONTAINER_CLEANEDUP_AFTER_KILL to DONE 2017-04-03 10:15:24,757 INFO application.ApplicationImpl (ApplicationImpl.java:transition(347)) - Removing container_e10_1484675915702_18333_01_000003 from application application_1484675915702_18333 2017-04-03 10:15:24,757 INFO logaggregation.AppLogAggregatorImpl (AppLogAggregatorImpl.java:startContainerLogAggregation(512)) - Considering container container_e10_1484675915702_18333_01_000003 for log-aggregation 2017-04-03 10:15:24,758 INFO yarn.YarnShuffleService (YarnShuffleService.java:stopContainer(190)) - Stopping container container_e10_1484675915702_18333_01_000003 2017-04-03 10:15:24,758 INFO yarn.YarnShuffleService (YarnShuffleService.java:stopContainer(191)) - Stopping container container_e10_1484675915702_18333_01_000003 2017-04-03 10:15:26,338 INFO nodemanager.NodeStatusUpdaterImpl (NodeStatusUpdaterImpl.java:removeOrTrackCompletedContainersFromContext(553)) - Removed completed containers from NM context: [container_e10_1484675915702_18333_01_000003] 2017-04-03 10:15:27,294 INFO monitor.ContainersMonitorImpl (ContainersMonitorImpl.java:run(390)) - Stopping resource-monitoring for container_e10_1484675915702_18333_01_000003 2017-04-03 10:15:34,491 INFO logaggregation.AppLogAggregatorImpl (AppLogAggregatorImpl.java:doContainerLogAggregation(567)) - Uploading logs for container container_e10_1484675915702_18333_01_000003. Current good log dirs are /grid/1/hadoop/yarn/log,/grid/2/hadoop/yarn/log,/grid/3/hadoop/yarn/log,/grid/0/hadoop/yarn/log 2017-04-03 10:15:34,495 INFO nodemanager.DefaultContainerExecutor (DefaultContainerExecutor.java:deleteAsUser(489)) - Deleting path : /grid/1/hadoop/yarn/log/application_1484675915702_18333/container_e10_1484675915702_18333_01_000003/syslog 2017-04-03 10:15:34,496 INFO nodemanager.DefaultContainerExecutor (DefaultContainerExecutor.java:deleteAsUser(489)) - Deleting path : /grid/1/hadoop/yarn/log/application_1484675915702_18333/container_e10_1484675915702_18333_01_000003/directory.info 2017-04-03 10:15:34,496 INFO nodemanager.DefaultContainerExecutor (DefaultContainerExecutor.java:deleteAsUser(489)) - Deleting path : /grid/1/hadoop/yarn/log/application_1484675915702_18333/container_e10_1484675915702_18333_01_000003/stdout 2017-04-03 10:15:34,496 INFO nodemanager.DefaultContainerExecutor (DefaultContainerExecutor.java:deleteAsUser(489)) - Deleting path : /grid/1/hadoop/yarn/log/application_1484675915702_18333/container_e10_1484675915702_18333_01_000003/stderr 2017-04-03 10:15:34,496 INFO nodemanager.DefaultContainerExecutor (DefaultContainerExecutor.java:deleteAsUser(489)) - Deleting path : /grid/1/hadoop/yarn/log/application_1484675915702_18333/container_e10_1484675915702_18333_01_000003/launch_container.sh
Created 04-11-2017 08:22 PM
Exit code 143 is related to Memory/GC issues. Your default Mapper/reducer memory setting may not be sufficient to run the large data set. Thus, try setting up higher AM, MAP and REDUCER memory when a large yarn job is invoked.
Created 05-11-2018 12:01 PM
When I try to execute MapReduce Program, It gives me error like "Time out after 600 secs", "Container killed by ApplicationMaster.", "Container killed on request. Exit code is 143". It show also map 100% and reduce stuck at 72%.
Can you please help?
Created 07-12-2018 08:40 PM
Are you running Distributed shell application ? Default client timeout for Dshell is 600 secs. You can extend client timeout using "-timeout <milliseconds>" in application launch command.
Created 07-12-2018 08:29 PM
Created 07-13-2018 08:54 PM
Exit code 143 is due to multiple reasons. Yesterday I got the error in sqoop related to timeout. Adding -Dmapreduce.task.timeout=0 in my sqoop job resolved the issue.
18/07/12 06:40:28 INFO mapreduce.Job: Job job_1530133778859_8931 running in uber mode : false
18/07/12 06:40:28 INFO mapreduce.Job: map 0% reduce 0%
18/07/12 06:45:57 INFO mapreduce.Job: Task Id : attempt_1530133778859_8931_m_000005_0, Status : FAILED
AttemptID:attempt_1530133778859_8931_m_000005_0 Timed out after 300 secs
Container killed by the ApplicationMaster.
Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143.