Support Questions

Find answers, ask questions, and share your expertise

Exception from container-launch MapReduce Application failed

avatar
Contributor

Hello!

I'm facing an error due to an Exception from container-launch. 

This is happening with the same application, is a high resource consuming job that failed ramdomly.

I copied the logs from the hystory server web interface, the resource manager server and the nodemanager.

  

User:

my_user

Name:

my_app_name

Application Type:

MAPREDUCE

Application Tags:

 

State:

FAILED

FinalStatus:

FAILED

Started:

Sat May 27 04:00:19 +0000 2017

Elapsed:

12hrs, 29mins, 43sec

Tracking URL:

History

Diagnostics:

Application application_3091752970321_0018 failed 2 times due to AM Container for appattempt_3091752970321_0018_000002 exited with exitCode: 255

For more detailed output, check application tracking page:http://my_active_name_node:8088/proxy/application_3091752970321_0018/Then, click on links to logs of each attempt.

Diagnostics: Exception from container-launch.

Container id: container_3091752970321_0018_02_000001

Exit code: 255

Stack trace: ExitCodeException exitCode=255:

at org.apache.hadoop.util.Shell.runCommand(Shell.java:561)

at org.apache.hadoop.util.Shell.run(Shell.java:478)

at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:738)

at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.launchContainer(LinuxContainerExecutor.java:367)

at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)

at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)

at java.util.concurrent.FutureTask.run(FutureTask.java:262)

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

at java.lang.Thread.run(Thread.java:745)

Shell output: main : command provided 1

main : run as user is my_user

main : requested yarn user is my_user

Writing to tmp file /storage/vol/06/yarn/local/nmPrivate/application_3091752970321_0018/container_3091752970321_0018_02_000001/container_3091752970321_0018_02_000001.pid.tmp

Container exited with a non-zero exit code 255

Failing this attempt. Failing the application.

 

LOGS on resourcemanager

2017-05-27 16:30:03,637 WARN org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Unable to write fail flag file for application appattempt_3091752970321_0018_000002

org.apache.hadoop.security.AccessControlException: Permission denied: user=yarn, access=WRITE, inode="/tmp/hadoop-yarn":hdfs:supergroup:drwxrwxr-x

        at org.apache.hadoop.hdfs.server.namenode.DefaultAuthorizationProvider.checkFsPermission(DefaultAuthorizationProvider.java:281)

        at org.apache.hadoop.hdfs.server.namenode.DefaultAuthorizationProvider.check(DefaultAuthorizationProvider.java:262)

        at org.apache.hadoop.hdfs.server.namenode.DefaultAuthorizationProvider.check(DefaultAuthorizationProvider.java:242)

        at org.apache.hadoop.hdfs.server.namenode.DefaultAuthorizationProvider.checkPermission(DefaultAuthorizationProvider.java:169)

        at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:152)

        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkPermission(FSNamesystem.java:6590)

        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkPermission(FSNamesystem.java:6572)

        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkAncestorAccess(FSNamesystem.java:6524)

        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirsInternal(FSNamesystem.java:4322)

        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirsInt(FSNamesystem.java:4292)

        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirs(FSNamesystem.java:4265)

        at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.mkdirs(NameNodeRpcServer.java:867)

        at org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.mkdirs(AuthorizationProviderProxyClientProtocol.java:322)

        at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.mkdirs(ClientNamenodeProtocolServerSideTranslatorPB.java:603)

        at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)

        at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617)

        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1073)

        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2086)

        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2082)

        at java.security.AccessController.doPrivileged(Native Method)

        at javax.security.auth.Subject.doAs(Subject.java:415)

        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693)

        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2080)

        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)

        at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)

        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)

        at java.lang.reflect.Constructor.newInstance(Constructor.java:526)

        at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106)

        at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:73)

        at org.apache.hadoop.hdfs.DFSClient.primitiveMkdir(DFSClient.java:3084)

        at org.apache.hadoop.hdfs.DFSClient.mkdirs(DFSClient.java:3049)

        at org.apache.hadoop.hdfs.DistributedFileSystem$18.doCall(DistributedFileSystem.java:957)

        at org.apache.hadoop.hdfs.DistributedFileSystem$18.doCall(DistributedFileSystem.java:953)

        at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)

        at org.apache.hadoop.hdfs.DistributedFileSystem.mkdirsInternal(DistributedFileSystem.java:953)

        at org.apache.hadoop.hdfs.DistributedFileSystem.mkdirs(DistributedFileSystem.java:946)

        at org.apache.hadoop.fs.FileSystem.mkdirs(FileSystem.java:1861)

        at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.writeFlagFileForFailedAM(RMAppImpl.java:1351)

        at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.access$3500(RMAppImpl.java:115)

        at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl$AttemptFailedFinalStateSavedTransition.transition(RMAppImpl.java:1035)

        at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl$AttemptFailedFinalStateSavedTransition.transition(RMAppImpl.java:1027)

        at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl$FinalStateSavedTransition.transition(RMAppImpl.java:1016)

        at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl$FinalStateSavedTransition.transition(RMAppImpl.java:1010)

        at org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385)

        at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)

        at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)

        at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)

        at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:780)

        at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:114)

        at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationEventDispatcher.handle(ResourceManager.java:787)

        at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationEventDispatcher.handle(ResourceManager.java:771)

        at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:176)

        at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:108)

        at java.lang.Thread.run(Thread.java:745)

Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.AccessControlException): Permission denied: user=yarn, access=WRITE, inode="/tmp/hadoop-yarn":hdfs:supergroup:drwxrwxr-x

        at org.apache.hadoop.hdfs.server.namenode.DefaultAuthorizationProvider.checkFsPermission(DefaultAuthorizationProvider.java:281)

        at org.apache.hadoop.hdfs.server.namenode.DefaultAuthorizationProvider.check(DefaultAuthorizationProvider.java:262)

        at org.apache.hadoop.hdfs.server.namenode.DefaultAuthorizationProvider.check(DefaultAuthorizationProvider.java:242)

        at org.apache.hadoop.hdfs.server.namenode.DefaultAuthorizationProvider.checkPermission(DefaultAuthorizationProvider.java:169)

        at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:152)

        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkPermission(FSNamesystem.java:6590)

        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkPermission(FSNamesystem.java:6572)

        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkAncestorAccess(FSNamesystem.java:6524)

        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirsInternal(FSNamesystem.java:4322)

        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirsInt(FSNamesystem.java:4292)

        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirs(FSNamesystem.java:4265)

        at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.mkdirs(NameNodeRpcServer.java:867)

        at org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.mkdirs(AuthorizationProviderProxyClientProtocol.java:322)

        at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.mkdirs(ClientNamenodeProtocolServerSideTranslatorPB.java:603)

        at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)

        at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617)

        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1073)

        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2086)

        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2082)

        at java.security.AccessController.doPrivileged(Native Method)

        at javax.security.auth.Subject.doAs(Subject.java:415)

        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693)

        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2080)

        at org.apache.hadoop.ipc.Client.call(Client.java:1471)

        at org.apache.hadoop.ipc.Client.call(Client.java:1408)

        at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230)

        at com.sun.proxy.$Proxy84.mkdirs(Unknown Source)

        at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.mkdirs(ClientNamenodeProtocolTranslatorPB.java:544)

        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)

        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

        at java.lang.reflect.Method.invoke(Method.java:606)

        at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:256)

        at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104)

        at com.sun.proxy.$Proxy85.mkdirs(Unknown Source)

        at org.apache.hadoop.hdfs.DFSClient.primitiveMkdir(DFSClient.java:3082)

        ... 24 more

2017-05-27 16:30:03,653 WARN org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=my_user OPERATION=Application Finished - Failed TARGET=RMAppManager     RESULT=FAILURE  DESCRIPTION=App failed with state: FAILED       PERMISSIONS=Application application_3091752970321_0018 failed 2 times due to AM Container for appattempt_3091752970321_0018_000002 exited with  exitCode: 255

For more detailed output, check application tracking page:http://my_active_name_node:8088/proxy/application_3091752970321_0018/Then, click on links to logs of each attempt.

Diagnostics: Exception from container-launch.

Container id: container_3091752970321_0018_02_000001

Exit code: 255

Stack trace: ExitCodeException exitCode=255:

        at org.apache.hadoop.util.Shell.runCommand(Shell.java:561)

        at org.apache.hadoop.util.Shell.run(Shell.java:478)

        at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:738)

        at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.launchContainer(LinuxContainerExecutor.java:367)

        at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)

        at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)

        at java.util.concurrent.FutureTask.run(FutureTask.java:262)

        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

        at java.lang.Thread.run(Thread.java:745)

 

Shell output: main : command provided 1

main : run as user is my_user

main : requested yarn user is my_user

Writing to tmp file /storage/vol/06/yarn/local/nmPrivate/application_3091752970321_0018/container_3091752970321_0018_02_000001/container_3091752970321_0018_02_000001.pid.tmp

 

 

Container exited with a non-zero exit code 255

Failing this attempt. Failing the application.  APPID=application_3091752970321_0018

2017-05-27 16:30:03,653 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAppManager$ApplicationSummary: appId=application_3091752970321_0018,name=my_app_name,user=my_user,queue=root.my_user,state=FAILED,trackingUrl=http://my_active_name_node:8088/cluster/app/application_3091752970321_0018,appMasterHost=N/A,startTi...

 

LOGS on nodemanager

 

2017-05-27 16:28:55,543 INFO org.apache.spark.network.yarn.YarnShuffleService: Initializing container container_1495752970061_0114_01_000411

2017-05-27 16:30:02,143 WARN org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Exit code from container container_3091752970321_0018_02_000001 is : 255

2017-05-27 16:30:02,143 WARN org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Exception from container-launch with container ID: container_3091752970321_0018_02_000001 and exit code: 255

ExitCodeException exitCode=255:

        at org.apache.hadoop.util.Shell.runCommand(Shell.java:561)

        at org.apache.hadoop.util.Shell.run(Shell.java:478)

        at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:738)

        at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.launchContainer(LinuxContainerExecutor.java:367)

        at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)

        at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)

        at java.util.concurrent.FutureTask.run(FutureTask.java:262)

        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

        at java.lang.Thread.run(Thread.java:745)

2017-05-27 16:30:02,144 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: Container exited with a non-zero exit code 255

2017-05-27 16:30:02,189 WARN org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=my_user     OPERATION=Container Finished - Failed   TARGET=ContainerImpl    RESULT=FAILURE  DESCRIPTION=Container failed with state: EXITED_WITH_FAILURE    APPID=application_3091752970321_0018    CONTAINERID=container_3091752970321_0018_02_000001

2017-05-27 16:30:02,189 INFO org.apache.spark.network.yarn.YarnShuffleService: Stopping container container_3091752970321_0018_02_000001

2017-05-27 16:30:02,857 INFO SecurityLogger.org.apache.hadoop.ipc.Server: Auth successful for appattempt_3091752970321_0018_000002 (auth:SIMPLE)

2017-05-27 16:30:02,866 INFO SecurityLogger.org.apache.hadoop.security.authorize.ServiceAuthorizationManager: Authorization successful for appattempt_3091752970321_0018_000002 (auth:TOKEN) for protocol=interface org.apache.hadoop.yarn.api.ContainerManagementProtocolPB

2017-05-27 16:30:02,900 WARN org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Exit code from container container_3091752970321_0018_02_019321 is : 143

2017-05-27 16:30:02,955 INFO org.apache.spark.network.yarn.YarnShuffleService: Stopping container container_3091752970321_0018_02_019321

2017-05-27 16:30:03,929 WARN org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Exit code from container container_3091752970321_0018_02_019064 is : 143

2017-05-27 16:30:04,015 INFO SecurityLogger.org.apache.hadoop.ipc.Server: Auth successful for appattempt_1495752970061_0114_000001 (auth:SIMPLE)

2017-05-27 16:30:04,018 INFO SecurityLogger.org.apache.hadoop.security.authorize.ServiceAuthorizationManager: Authorization successful for appattempt_1495752970061_0114_000001 (auth:TOKEN) for protocol=interface org.apache.hadoop.yarn.api.ContainerManagementProtocolPB

2017-05-27 16:30:04,094 WARN org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Exit code from container container_3091752970321_0018_02_019136 is : 143

2017-05-27 16:30:04,205 WARN org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Exit code from container container_3091752970321_0018_02_019173 is : 143

2017-05-27 16:30:04,337 WARN org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Exit code from container container_3091752970321_0018_02_019210 is : 143

2017-05-27 16:30:04,417 WARN org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Exit code from container container_3091752970321_0018_02_019247 is : 143

2017-05-27 16:30:04,562 WARN org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Exit code from container container_3091752970321_0018_02_019284 is : 143

2017-05-27 16:30:04,604 WARN org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Exit code from container container_3091752970321_0018_02_019358 is : 143

2017-05-27 16:30:04,636 WARN org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Exit code from container container_3091752970321_0018_02_019432 is : 143

2017-05-27 16:30:04,675 WARN org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Exit code from container container_3091752970321_0018_02_019469 is : 143

2017-05-27 16:30:04,721 WARN org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Exit code from container container_3091752970321_0018_02_019540 is : 143

2017-05-27 16:30:04,767 WARN org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Exit code from container container_3091752970321_0018_02_019610 is : 143

2017-05-27 16:30:04,811 WARN org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Exit code from container container_3091752970321_0018_02_019645 is : 143

2017-05-27 16:30:04,851 WARN org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Exit code from container container_3091752970321_0018_02_019679 is : 143

2017-05-27 16:30:04,974 WARN org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Exit code from container container_3091752970321_0018_02_019715 is : 143

2017-05-27 16:30:05,086 INFO org.apache.spark.network.yarn.YarnShuffleService: Initializing container container_1495752970061_0114_01_000412

2017-05-27 16:30:05,088 INFO org.apache.spark.network.yarn.YarnShuffleService: Stopping container container_3091752970321_0018_02_019064

2017-05-27 16:30:05,088 INFO org.apache.spark.network.yarn.YarnShuffleService: Stopping container container_3091752970321_0018_02_019136

2017-05-27 16:30:05,088 INFO org.apache.spark.network.yarn.YarnShuffleService: Stopping container container_3091752970321_0018_02_019173

2017-05-27 16:30:05,088 INFO org.apache.spark.network.yarn.YarnShuffleService: Stopping container container_3091752970321_0018_02_019210

2017-05-27 16:30:05,089 INFO org.apache.spark.network.yarn.YarnShuffleService: Stopping container container_3091752970321_0018_02_019247

2017-05-27 16:30:05,089 INFO org.apache.spark.network.yarn.YarnShuffleService: Stopping container container_3091752970321_0018_02_019284

2017-05-27 16:30:05,089 INFO org.apache.spark.network.yarn.YarnShuffleService: Stopping container container_3091752970321_0018_02_019358

2017-05-27 16:30:05,089 INFO org.apache.spark.network.yarn.YarnShuffleService: Stopping container container_3091752970321_0018_02_019432

2017-05-27 16:30:05,089 INFO org.apache.spark.network.yarn.YarnShuffleService: Stopping container container_3091752970321_0018_02_019469

2017-05-27 16:30:05,089 INFO org.apache.spark.network.yarn.YarnShuffleService: Stopping container container_3091752970321_0018_02_019540

2017-05-27 16:30:05,089 INFO org.apache.spark.network.yarn.YarnShuffleService: Stopping container container_3091752970321_0018_02_019610

2017-05-27 16:30:05,089 INFO org.apache.spark.network.yarn.YarnShuffleService: Stopping container container_3091752970321_0018_02_019645

2017-05-27 16:30:05,090 INFO org.apache.spark.network.yarn.YarnShuffleService: Stopping container container_3091752970321_0018_02_019679

2017-05-27 16:30:05,090 INFO org.apache.spark.network.yarn.YarnShuffleService: Stopping container container_3091752970321_0018_02_019715

2017-05-27 16:30:05,092 INFO org.apache.spark.network.yarn.YarnShuffleService: Stopping application application_3091752970321_0018

2017-05-27 16:30:05,092 ERROR org.apache.spark.network.yarn.YarnShuffleService: Exception when stopping application application_3091752970321_0018

java.lang.NullPointerException

        at org.apache.spark.network.yarn.YarnShuffleService.stopApplication(YarnShuffleService.java:174)

        at org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices.handle(AuxServices.java:215)

        at org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices.handle(AuxServices.java:49)

        at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:176)

        at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:108)

        at java.lang.Thread.run(Thread.java:745)

 

Thanks in advance!

4 REPLIES 4

avatar
Champion
You have two exit codes: 143 and 255. I have never seen the latter but based on the exception and messages I think it is trying to write out some log info due to the failure and failing at that as well.

To be more clear, the job failed, and then the RM tried to write something to /tmp/hadoop-yarn, but failed to do so. The permissions on that folder do not include the yarn account. The yarn account should be part of the hadoop group (run 'id yarn' to confirm). So you should be able to run 'sudo chown -R hdfs:hadoop". I am not positive but that folder should only be used by process ran by yarn, so it should be safe to just give yarn ownership over it as well.

Now, on to the job failure. There is only the exit code for each of the containers, which is 143. You would need to access the container logs for each to get more specific information. Anyway, generally that error code indicates a Out of Memory event. Either the container ran out of physical memory or exceed the virtual memory. The heap itself could also have been exhausted.

Can you provide the following settings for the job?

mapreduce.map.memory.mb
mapreduce.reduce.memory.mb
mapreduce.map.java.opts
mapreduce.reduce.java.opts

avatar
Contributor

Thanks @mbigelow!

Sure I can.

 

mapreduce.map.memory.mb = 6144
mapreduce.reduce.memory.mb = 12288
mapreduce.map.java.opts = -Xmx4300m >>> this value comes from mapreduce.map.memory.mb * 0.7
mapreduce.reduce.java.opts = -Xmx8601m >>> this value comes from mapreduce.reduce.memory.mb * 0.7

 

I think is an OOME, I created another post with more data that confirm this, you can find it at http://community.cloudera.com/t5/Batch-Processing-and-Workflow/MapReduce-application-failed-with-Out...

The log is from other application but the job is the same. 

 

Thanks!

 

Guido.

avatar
Champion
I just replied on the other topic. Lets continue the discussion over there.

avatar
New Contributor

   Kindly check log detail on Resource manager:

 

<clusterip>:8088

   Open terminal and checl actual log problem:

 

yarn logs -applicationId <APP_ID>

   Example: APP_ID = application_1535002188113_0001

 

   In mine case it was showing permission issue to directory '/user/history' so i gave it

 

sudo -u hdfs hadoop fs -chmod 775 /user/history

or

sudo -u hdfs hadoop fs -chmod 777 /user/history

  Then it works.