Created 12-28-2016 11:04 AM
i'm facing frequent container exception on particular cluster but same set of jobs ran fine in other cluster.
Exception from container-launch.
Container id: container_e64_1481762217559_27152_01_000002
Exit code: 127
Stack trace: ExitCodeException exitCode=127:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:576)
at org.apache.hadoop.util.Shell.run(Shell.java:487)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:753)
at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.launchContainer(LinuxContainerExecutor.java:371)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:303)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Shell output: main : command provided 1
main : run as user is xyz
main : requested yarn user is xyz
Created 12-28-2016 11:41 AM
exit code 127 usually refers user application specific issues.
Created 12-28-2016 11:51 AM
Also share if you are seeing any application specific exceptions apart from this.
Created 12-29-2016 06:08 AM
2016-12-28 08:28:12,219 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report from attempt_1481762217559_27152_m_000000_1: Exception from container-launch. Container id: container_e64_1481762217559_27152_01_000007 Exit code: 127 Stack trace: ExitCodeException exitCode=127: at org.apache.hadoop.util.Shell.runCommand(Shell.java:576) at org.apache.hadoop.util.Shell.run(Shell.java:487) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:753) at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.launchContainer(LinuxContainerExecutor.java:371) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:303) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745)
Created 12-29-2016 06:15 AM
am seeing few thread messages too.. INFO [Thread-55] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor: 1 failures on node xyz_123.com 2016-12-28 08:28:34,579 INFO [Thread-77] org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Stopped JobHistoryEventHandler. super.stop() 2016-12-28 08:28:34,582 INFO [Thread-77] org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator: Setting job diagnostics to Task failed task_1481762217559_27152_m_000000 Job failed as tasks failed. failedMaps:1 failedReduces:0
Created 12-29-2016 06:41 AM
also am seeing below messages from nodemanager
2016-12-28 08:28:26,005 INFO containermanager.AuxServices (AuxServices.java:handle(196)) - Got event CONTAINER_STOP for appId application_1481762217559_27152
2016-12-28 08:28:26,005 INFO yarn.YarnShuffleService (YarnShuffleService.java:stopContainer(189)) - Stopping container container_e64_1481762217559_27152_01_000008
2016-12-28 08:28:26,481 INFO ipc.Server (Server.java:saslProcess(1441)) - Auth successful for appattempt_1481762217559_27152_000001 (auth:SIMPLE)
2016-12-28 08:28:26,491 INFO authorize.ServiceAuthorizationManager (ServiceAuthorizationManager.java:authorize(135)) - Authorization successful for appattempt_1481762217559_27152_000001 (auth:TOKEN) for protocol=interface org.apache.hadoop.yarn.api.ContainerManagementProtocolPB
2016-12-28 08:28:26,491 INFO containermanager.ContainerManagerImpl (ContainerManagerImpl.java:stopContainerInternal(966)) - Stopping container with container Id: container_e64_1481762217559_27152_01_000008
2016-12-28 08:28:26,492 INFO nodemanager.NMAuditLogger (NMAuditLogger.java:logSuccess(89)) - USER=ajamwal IP=10.246.73.94 OPERATION=Stop Container Request TARGET=ContainerManageImpl RESULT=SUCCESS APPID=application_1481762217559_27152 CONTAINERID=container_e64_1481762217559_27152_01_000008
2016-12-28 08:28:26,817 INFO localizer.ResourceLocalizationService (ResourceLocalizationService.java:processHeartbeat(674)) - Unknown localizer with localizerId container_e64_1481762217559_27152_01_000008 is sending heartbeat. Ordering it to DIE
2016-12-28 08:28:26,818 INFO localizer.ResourceLocalizationService (ResourceLocalizationService.java:processHeartbeat(674)) - Unknown localizer with localizerId container_e64_1481762217559_27152_01_000008 is sending heartbeat. Ordering it to DIE
2016-12-28 08:28:27,227 INFO localizer.ResourceLocalizationService (ResourceLocalizationService.java:run(1131)) - Localizer failed
java.io.IOException: java.lang.InterruptedException
at org.apache.hadoop.util.Shell.runCommand(Shell.java:579)
at org.apache.hadoop.util.Shell.run(Shell.java:487)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:753)
at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.startLocalizer(LinuxContainerExecutor.java:258)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:1113)
2016-12-28 08:28:28,016 INFO nodemanager.NodeStatusUpdaterImpl (NodeStatusUpdaterImpl.java:removeOrTrackCompletedContainersFromContext(529)) - Removed completed containers from NM context: [container_e64_1481762217559_27152_01_000008]
Created 12-29-2016 06:50 AM
Exception from container-launch. Container id: container_e64_1481762217559_27152_01_000002 Exit code: 127 Stack trace: ExitCodeException exitCode=127: at org.apache.hadoop.util.Shell.runCommand(Shell.java:576) at org.apache.hadoop.util.Shell.run(Shell.java:487) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:753) at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.launchContainer(LinuxContainerExecutor.java:371) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:303) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Shell output: main : command provided 1 main : run as user is ajamwal main : requested yarn user is ajamwal Container exited with a non-zero exit code 127