Member since
02-15-2017
41
Posts
1
Kudos Received
2
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
23902 | 07-21-2017 10:30 AM | |
16332 | 07-04-2017 07:33 AM |
06-16-2017
11:28 AM
@mbigelow thats wright! the app used an older API from MRv1. The app uses almost the same amount of data each day plus aprox 30GB so the expected time could be a little bit more each day, as the app runs daily, but in the last days past from 9 hours to more than a day to finish. The solution I've founded was to decrease the amount of memory instead of increase it. Now the app can allocate more containers so the app is runing pretty fast. Thanks for your help, just in case I'll let you know. Guido.
... View more
06-16-2017
11:21 AM
Thank You Fawze and sorry for my delayed answer. I was sick. Is in both, the map and the reduce phase. The MR job takes aprox. 3TB from HBase and adds about 30GB of data per day. Cluster architecture: Number of NodeManagers = 37 Resource Managers = 2 with YARN in HA 5 nodes with 22 cores and 66GB of RAM 32 nodes with 30 cores and 120GB of RAM Total vcores = 1070 Total memory = 4.07 TB I realized that decreasing the memory instead of increasing got better results. Now with this params worked really well and could get all the cores. "mapreduce.map.memory.mb" 1536 "mapreduce.reduce.memory.mb" 3072 "mapreduce.map.java.opts" -Xmx1024m "mapreduce.reduce.java.opts" -Xmx2560m Thanks! Guido.
... View more
06-14-2017
01:05 PM
Hi @mbigelow After puting the properties from mapreduceV2 into mapred-site.xml and deleting them from yarn-site.xml the jobs started to run slower and slower and can only get 1 or 2 cores and few MB of RAM. Jobs before the changes used about 300 cores and 3811328 mb of memory. It looks like the node manager was reading these properties from yarn-site.xml. I know that they are in the wrong place.. but could be there any strange configuration that makes YARN behave like that? That's not make sense but maybe you've already seen this before. Thanks in advance! Guido.
... View more
06-14-2017
11:51 AM
Thanks @mbigelow! I hope that could solve the problem. I'll let you know. Thanks again. Guido.
... View more
06-13-2017
12:56 PM
Thanks @mbigelow! Sure I can. mapreduce.map.memory.mb = 6144 mapreduce.reduce.memory.mb = 12288 mapreduce.map.java.opts = -Xmx4300m >>> this value comes from mapreduce.map.memory.mb * 0.7 mapreduce.reduce.java.opts = -Xmx8601m >>> this value comes from mapreduce.reduce.memory.mb * 0.7 I think is an OOME, I created another post with more data that confirm this, you can find it at http://community.cloudera.com/t5/Batch-Processing-and-Workflow/MapReduce-application-failed-with-OutOfMemoryError/m-p/55708 The log is from other application but the job is the same. Thanks! Guido.
... View more
06-13-2017
11:24 AM
Hello! I'm facing an error due to an Exception from container-launch. This is happening with the same application, is a high resource consuming job that failed ramdomly. I copied the logs from the hystory server web interface, the resource manager server and the nodemanager. User: my_user Name: my_app_name Application Type: MAPREDUCE Application Tags: State: FAILED FinalStatus: FAILED Started: Sat May 27 04:00:19 +0000 2017 Elapsed: 12hrs, 29mins, 43sec Tracking URL: History Diagnostics: Application application_3091752970321_0018 failed 2 times due to AM Container for appattempt_3091752970321_0018_000002 exited with exitCode: 255 For more detailed output, check application tracking page:http://my_active_name_node:8088/proxy/application_3091752970321_0018/Then, click on links to logs of each attempt. Diagnostics: Exception from container-launch. Container id: container_3091752970321_0018_02_000001 Exit code: 255 Stack trace: ExitCodeException exitCode=255: at org.apache.hadoop.util.Shell.runCommand(Shell.java:561) at org.apache.hadoop.util.Shell.run(Shell.java:478) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:738) at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.launchContainer(LinuxContainerExecutor.java:367) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Shell output: main : command provided 1 main : run as user is my_user main : requested yarn user is my_user Writing to tmp file /storage/vol/06/yarn/local/nmPrivate/application_3091752970321_0018/container_3091752970321_0018_02_000001/container_3091752970321_0018_02_000001.pid.tmp Container exited with a non-zero exit code 255 Failing this attempt. Failing the application. LOGS on resourcemanager 2017-05-27 16:30:03,637 WARN org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Unable to write fail flag file for application appattempt_3091752970321_0018_000002 org.apache.hadoop.security.AccessControlException: Permission denied: user=yarn, access=WRITE, inode="/tmp/hadoop-yarn":hdfs:supergroup:drwxrwxr-x at org.apache.hadoop.hdfs.server.namenode.DefaultAuthorizationProvider.checkFsPermission(DefaultAuthorizationProvider.java:281) at org.apache.hadoop.hdfs.server.namenode.DefaultAuthorizationProvider.check(DefaultAuthorizationProvider.java:262) at org.apache.hadoop.hdfs.server.namenode.DefaultAuthorizationProvider.check(DefaultAuthorizationProvider.java:242) at org.apache.hadoop.hdfs.server.namenode.DefaultAuthorizationProvider.checkPermission(DefaultAuthorizationProvider.java:169) at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:152) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkPermission(FSNamesystem.java:6590) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkPermission(FSNamesystem.java:6572) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkAncestorAccess(FSNamesystem.java:6524) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirsInternal(FSNamesystem.java:4322) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirsInt(FSNamesystem.java:4292) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirs(FSNamesystem.java:4265) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.mkdirs(NameNodeRpcServer.java:867) at org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.mkdirs(AuthorizationProviderProxyClientProtocol.java:322) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.mkdirs(ClientNamenodeProtocolServerSideTranslatorPB.java:603) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1073) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2086) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2082) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2080) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:526) at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106) at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:73) at org.apache.hadoop.hdfs.DFSClient.primitiveMkdir(DFSClient.java:3084) at org.apache.hadoop.hdfs.DFSClient.mkdirs(DFSClient.java:3049) at org.apache.hadoop.hdfs.DistributedFileSystem$18.doCall(DistributedFileSystem.java:957) at org.apache.hadoop.hdfs.DistributedFileSystem$18.doCall(DistributedFileSystem.java:953) at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) at org.apache.hadoop.hdfs.DistributedFileSystem.mkdirsInternal(DistributedFileSystem.java:953) at org.apache.hadoop.hdfs.DistributedFileSystem.mkdirs(DistributedFileSystem.java:946) at org.apache.hadoop.fs.FileSystem.mkdirs(FileSystem.java:1861) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.writeFlagFileForFailedAM(RMAppImpl.java:1351) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.access$3500(RMAppImpl.java:115) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl$AttemptFailedFinalStateSavedTransition.transition(RMAppImpl.java:1035) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl$AttemptFailedFinalStateSavedTransition.transition(RMAppImpl.java:1027) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl$FinalStateSavedTransition.transition(RMAppImpl.java:1016) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl$FinalStateSavedTransition.transition(RMAppImpl.java:1010) at org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385) at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:780) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:114) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationEventDispatcher.handle(ResourceManager.java:787) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationEventDispatcher.handle(ResourceManager.java:771) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:176) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:108) at java.lang.Thread.run(Thread.java:745) Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.AccessControlException): Permission denied: user=yarn, access=WRITE, inode="/tmp/hadoop-yarn":hdfs:supergroup:drwxrwxr-x at org.apache.hadoop.hdfs.server.namenode.DefaultAuthorizationProvider.checkFsPermission(DefaultAuthorizationProvider.java:281) at org.apache.hadoop.hdfs.server.namenode.DefaultAuthorizationProvider.check(DefaultAuthorizationProvider.java:262) at org.apache.hadoop.hdfs.server.namenode.DefaultAuthorizationProvider.check(DefaultAuthorizationProvider.java:242) at org.apache.hadoop.hdfs.server.namenode.DefaultAuthorizationProvider.checkPermission(DefaultAuthorizationProvider.java:169) at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:152) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkPermission(FSNamesystem.java:6590) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkPermission(FSNamesystem.java:6572) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkAncestorAccess(FSNamesystem.java:6524) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirsInternal(FSNamesystem.java:4322) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirsInt(FSNamesystem.java:4292) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirs(FSNamesystem.java:4265) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.mkdirs(NameNodeRpcServer.java:867) at org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.mkdirs(AuthorizationProviderProxyClientProtocol.java:322) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.mkdirs(ClientNamenodeProtocolServerSideTranslatorPB.java:603) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1073) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2086) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2082) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2080) at org.apache.hadoop.ipc.Client.call(Client.java:1471) at org.apache.hadoop.ipc.Client.call(Client.java:1408) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230) at com.sun.proxy.$Proxy84.mkdirs(Unknown Source) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.mkdirs(ClientNamenodeProtocolTranslatorPB.java:544) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:256) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104) at com.sun.proxy.$Proxy85.mkdirs(Unknown Source) at org.apache.hadoop.hdfs.DFSClient.primitiveMkdir(DFSClient.java:3082) ... 24 more 2017-05-27 16:30:03,653 WARN org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=my_user OPERATION=Application Finished - Failed TARGET=RMAppManager RESULT=FAILURE DESCRIPTION=App failed with state: FAILED PERMISSIONS=Application application_3091752970321_0018 failed 2 times due to AM Container for appattempt_3091752970321_0018_000002 exited with exitCode: 255 For more detailed output, check application tracking page:http://my_active_name_node:8088/proxy/application_3091752970321_0018/Then, click on links to logs of each attempt. Diagnostics: Exception from container-launch. Container id: container_3091752970321_0018_02_000001 Exit code: 255 Stack trace: ExitCodeException exitCode=255: at org.apache.hadoop.util.Shell.runCommand(Shell.java:561) at org.apache.hadoop.util.Shell.run(Shell.java:478) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:738) at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.launchContainer(LinuxContainerExecutor.java:367) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Shell output: main : command provided 1 main : run as user is my_user main : requested yarn user is my_user Writing to tmp file /storage/vol/06/yarn/local/nmPrivate/application_3091752970321_0018/container_3091752970321_0018_02_000001/container_3091752970321_0018_02_000001.pid.tmp Container exited with a non-zero exit code 255 Failing this attempt. Failing the application. APPID=application_3091752970321_0018 2017-05-27 16:30:03,653 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAppManager$ApplicationSummary: appId=application_3091752970321_0018,name=my_app_name,user=my_user,queue=root.my_user,state=FAILED,trackingUrl=http://my_active_name_node:8088/cluster/app/application_3091752970321_0018,appMasterHost=N/A,startTime=1495857619484,finishTime=1495902602847,finalStatus=FAILED LOGS on nodemanager 2017-05-27 16:28:55,543 INFO org.apache.spark.network.yarn.YarnShuffleService: Initializing container container_1495752970061_0114_01_000411 2017-05-27 16:30:02,143 WARN org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Exit code from container container_3091752970321_0018_02_000001 is : 255 2017-05-27 16:30:02,143 WARN org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Exception from container-launch with container ID: container_3091752970321_0018_02_000001 and exit code: 255 ExitCodeException exitCode=255: at org.apache.hadoop.util.Shell.runCommand(Shell.java:561) at org.apache.hadoop.util.Shell.run(Shell.java:478) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:738) at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.launchContainer(LinuxContainerExecutor.java:367) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 2017-05-27 16:30:02,144 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: Container exited with a non-zero exit code 255 2017-05-27 16:30:02,189 WARN org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=my_user OPERATION=Container Finished - Failed TARGET=ContainerImpl RESULT=FAILURE DESCRIPTION=Container failed with state: EXITED_WITH_FAILURE APPID=application_3091752970321_0018 CONTAINERID=container_3091752970321_0018_02_000001 2017-05-27 16:30:02,189 INFO org.apache.spark.network.yarn.YarnShuffleService: Stopping container container_3091752970321_0018_02_000001 2017-05-27 16:30:02,857 INFO SecurityLogger.org.apache.hadoop.ipc.Server: Auth successful for appattempt_3091752970321_0018_000002 (auth:SIMPLE) 2017-05-27 16:30:02,866 INFO SecurityLogger.org.apache.hadoop.security.authorize.ServiceAuthorizationManager: Authorization successful for appattempt_3091752970321_0018_000002 (auth:TOKEN) for protocol=interface org.apache.hadoop.yarn.api.ContainerManagementProtocolPB 2017-05-27 16:30:02,900 WARN org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Exit code from container container_3091752970321_0018_02_019321 is : 143 2017-05-27 16:30:02,955 INFO org.apache.spark.network.yarn.YarnShuffleService: Stopping container container_3091752970321_0018_02_019321 2017-05-27 16:30:03,929 WARN org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Exit code from container container_3091752970321_0018_02_019064 is : 143 2017-05-27 16:30:04,015 INFO SecurityLogger.org.apache.hadoop.ipc.Server: Auth successful for appattempt_1495752970061_0114_000001 (auth:SIMPLE) 2017-05-27 16:30:04,018 INFO SecurityLogger.org.apache.hadoop.security.authorize.ServiceAuthorizationManager: Authorization successful for appattempt_1495752970061_0114_000001 (auth:TOKEN) for protocol=interface org.apache.hadoop.yarn.api.ContainerManagementProtocolPB 2017-05-27 16:30:04,094 WARN org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Exit code from container container_3091752970321_0018_02_019136 is : 143 2017-05-27 16:30:04,205 WARN org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Exit code from container container_3091752970321_0018_02_019173 is : 143 2017-05-27 16:30:04,337 WARN org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Exit code from container container_3091752970321_0018_02_019210 is : 143 2017-05-27 16:30:04,417 WARN org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Exit code from container container_3091752970321_0018_02_019247 is : 143 2017-05-27 16:30:04,562 WARN org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Exit code from container container_3091752970321_0018_02_019284 is : 143 2017-05-27 16:30:04,604 WARN org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Exit code from container container_3091752970321_0018_02_019358 is : 143 2017-05-27 16:30:04,636 WARN org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Exit code from container container_3091752970321_0018_02_019432 is : 143 2017-05-27 16:30:04,675 WARN org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Exit code from container container_3091752970321_0018_02_019469 is : 143 2017-05-27 16:30:04,721 WARN org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Exit code from container container_3091752970321_0018_02_019540 is : 143 2017-05-27 16:30:04,767 WARN org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Exit code from container container_3091752970321_0018_02_019610 is : 143 2017-05-27 16:30:04,811 WARN org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Exit code from container container_3091752970321_0018_02_019645 is : 143 2017-05-27 16:30:04,851 WARN org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Exit code from container container_3091752970321_0018_02_019679 is : 143 2017-05-27 16:30:04,974 WARN org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Exit code from container container_3091752970321_0018_02_019715 is : 143 2017-05-27 16:30:05,086 INFO org.apache.spark.network.yarn.YarnShuffleService: Initializing container container_1495752970061_0114_01_000412 2017-05-27 16:30:05,088 INFO org.apache.spark.network.yarn.YarnShuffleService: Stopping container container_3091752970321_0018_02_019064 2017-05-27 16:30:05,088 INFO org.apache.spark.network.yarn.YarnShuffleService: Stopping container container_3091752970321_0018_02_019136 2017-05-27 16:30:05,088 INFO org.apache.spark.network.yarn.YarnShuffleService: Stopping container container_3091752970321_0018_02_019173 2017-05-27 16:30:05,088 INFO org.apache.spark.network.yarn.YarnShuffleService: Stopping container container_3091752970321_0018_02_019210 2017-05-27 16:30:05,089 INFO org.apache.spark.network.yarn.YarnShuffleService: Stopping container container_3091752970321_0018_02_019247 2017-05-27 16:30:05,089 INFO org.apache.spark.network.yarn.YarnShuffleService: Stopping container container_3091752970321_0018_02_019284 2017-05-27 16:30:05,089 INFO org.apache.spark.network.yarn.YarnShuffleService: Stopping container container_3091752970321_0018_02_019358 2017-05-27 16:30:05,089 INFO org.apache.spark.network.yarn.YarnShuffleService: Stopping container container_3091752970321_0018_02_019432 2017-05-27 16:30:05,089 INFO org.apache.spark.network.yarn.YarnShuffleService: Stopping container container_3091752970321_0018_02_019469 2017-05-27 16:30:05,089 INFO org.apache.spark.network.yarn.YarnShuffleService: Stopping container container_3091752970321_0018_02_019540 2017-05-27 16:30:05,089 INFO org.apache.spark.network.yarn.YarnShuffleService: Stopping container container_3091752970321_0018_02_019610 2017-05-27 16:30:05,089 INFO org.apache.spark.network.yarn.YarnShuffleService: Stopping container container_3091752970321_0018_02_019645 2017-05-27 16:30:05,090 INFO org.apache.spark.network.yarn.YarnShuffleService: Stopping container container_3091752970321_0018_02_019679 2017-05-27 16:30:05,090 INFO org.apache.spark.network.yarn.YarnShuffleService: Stopping container container_3091752970321_0018_02_019715 2017-05-27 16:30:05,092 INFO org.apache.spark.network.yarn.YarnShuffleService: Stopping application application_3091752970321_0018 2017-05-27 16:30:05,092 ERROR org.apache.spark.network.yarn.YarnShuffleService: Exception when stopping application application_3091752970321_0018 java.lang.NullPointerException at org.apache.spark.network.yarn.YarnShuffleService.stopApplication(YarnShuffleService.java:174) at org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices.handle(AuxServices.java:215) at org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices.handle(AuxServices.java:49) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:176) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:108) at java.lang.Thread.run(Thread.java:745) Thanks in advance!
... View more
Labels:
- Labels:
-
Apache YARN
-
MapReduce
06-13-2017
07:56 AM
Hello! I've a cluster up and running. The cluster is running an old CDH version (5.7.0) with 2 NNs and 43 DNs. Is it possible to install Cloudera Manager? Thanks! Guido.
... View more
Labels:
- Labels:
-
Cloudera Manager
06-12-2017
08:25 AM
Hello! since my cluster started to run more Spark and MapReduce jobs than before a high resource consuming MAPREDUCE application crushed randomly. The error is always the same (OOME). My cluster nodes have 128GB of RAM and 32 cores each, Kerberos and HA for HDFS and Yarn are enabled (we use Yarn instead of MapReduceV1). App details - Total maps: 18922 - Total reducers: 983 - Aprox. Allocated CPU vCores: 283 - Aprox. Allocated Memory MB: 3467264 Below I pasted: - The log from my History Server. - The properties related with memory configuration from my yarn-site.xml, mapred-site.xml and hadoop-env.sh. - JAVA per node info. ----------------------------------------------------------------------------------------------------- History Server log (from one YARN Node Manager where the app failed) Log Type: stderr Log Upload Time: Sun Jun 11 13:49:24 +0000 2017 Log Length: 25992448 Showing 4096 bytes of 25992448 total. Click here for the full log. v2.runtime.XMLSerializer.leafElement(XMLSerializer.java:327) at com.sun.xml.bind.v2.model.impl.RuntimeBuiltinLeafInfoImpl$StringImplImpl.writeLeafElement(RuntimeBuiltinLeafInfoImpl.java:1045) at com.sun.xml.bind.v2.model.impl.RuntimeBuiltinLeafInfoImpl$StringImplImpl.writeLeafElement(RuntimeBuiltinLeafInfoImpl.java:1024) at com.sun.xml.bind.v2.model.impl.RuntimeEnumLeafInfoImpl.writeLeafElement(RuntimeEnumLeafInfoImpl.java:169) at com.sun.xml.bind.v2.model.impl.RuntimeEnumLeafInfoImpl.writeLeafElement(RuntimeEnumLeafInfoImpl.java:69) at com.sun.xml.bind.v2.runtime.reflect.TransducedAccessor$CompositeTransducedAccessorImpl.writeLeafElement(TransducedAccessor.java:256) at com.sun.xml.bind.v2.runtime.property.SingleElementLeafProperty.serializeBody(SingleElementLeafProperty.java:128) at com.sun.xml.bind.v2.runtime.ClassBeanInfoImpl.serializeBody(ClassBeanInfoImpl.java:344) at com.sun.xml.bind.v2.runtime.XMLSerializer.childAsXsiType(XMLSerializer.java:700) at com.sun.xml.bind.v2.runtime.property.ArrayElementNodeProperty.serializeItem(ArrayElementNodeProperty.java:69) at com.sun.xml.bind.v2.runtime.property.ArrayElementProperty.serializeListBody(ArrayElementProperty.java:172) at com.sun.xml.bind.v2.runtime.property.ArrayERProperty.serializeBody(ArrayERProperty.java:159) at com.sun.xml.bind.v2.runtime.ClassBeanInfoImpl.serializeBody(ClassBeanInfoImpl.java:344) at com.sun.xml.bind.v2.runtime.XMLSerializer.childAsSoleContent(XMLSerializer.java:597) at com.sun.xml.bind.v2.runtime.ClassBeanInfoImpl.serializeRoot(ClassBeanInfoImpl.java:328) at com.sun.xml.bind.v2.runtime.XMLSerializer.childAsRoot(XMLSerializer.java:498) at com.sun.xml.bind.v2.runtime.MarshallerImpl.write(MarshallerImpl.java:320) ... 39 more Caused by: org.mortbay.jetty.EofException at org.mortbay.jetty.HttpGenerator.flush(HttpGenerator.java:791) at org.mortbay.jetty.AbstractGenerator$Output.flush(AbstractGenerator.java:569) at org.mortbay.jetty.HttpConnection$Output.flush(HttpConnection.java:1012) at org.mortbay.jetty.AbstractGenerator$Output.write(AbstractGenerator.java:651) at org.mortbay.jetty.AbstractGenerator$Output.write(AbstractGenerator.java:580) at com.sun.jersey.spi.container.servlet.WebComponent$Writer.write(WebComponent.java:307) at java.util.zip.DeflaterOutputStream.deflate(DeflaterOutputStream.java:253) at java.util.zip.DeflaterOutputStream.write(DeflaterOutputStream.java:211) at java.util.zip.GZIPOutputStream.write(GZIPOutputStream.java:146) at com.sun.jersey.spi.container.ContainerResponse$CommittingOutputStream.write(ContainerResponse.java:134) at sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:221) at sun.nio.cs.StreamEncoder.implWrite(StreamEncoder.java:282) at sun.nio.cs.StreamEncoder.write(StreamEncoder.java:125) at java.io.OutputStreamWriter.write(OutputStreamWriter.java:207) at org.codehaus.jackson.impl.WriterBasedGenerator._flushBuffer(WriterBasedGenerator.java:1812) at org.codehaus.jackson.impl.WriterBasedGenerator._writeString(WriterBasedGenerator.java:987) at org.codehaus.jackson.impl.WriterBasedGenerator.writeString(WriterBasedGenerator.java:448) at com.sun.jersey.json.impl.writer.JacksonStringMergingGenerator.flushPreviousString(JacksonStringMergingGenerator.java:311) at com.sun.jersey.json.impl.writer.JacksonStringMergingGenerator.writeFieldName(JacksonStringMergingGenerator.java:139) at com.sun.jersey.json.impl.writer.Stax2JacksonWriter.writeStartElement(Stax2JacksonWriter.java:183) ... 58 more Caused by: java.io.IOException: Broken pipe at sun.nio.ch.FileDispatcherImpl.write0(Native Method) at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47) at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93) at sun.nio.ch.IOUtil.write(IOUtil.java:65) at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:492) at org.mortbay.io.nio.ChannelEndPoint.flush(ChannelEndPoint.java:171) at org.mortbay.io.nio.SelectChannelEndPoint.flush(SelectChannelEndPoint.java:221) at org.mortbay.jetty.HttpGenerator.flush(HttpGenerator.java:725) ... 77 more Halting due to Out Of Memory Error... Log Type: stdout Log Upload Time: Sun Jun 11 13:49:24 +0000 2017 Log Length: 0 Log Type: syslog Log Upload Time: Sun Jun 11 13:49:24 +0000 2017 Log Length: 24905123 Showing 4096 bytes of 24905123 total. Click here for the full log. che.hadoop.mapred.TaskUmbilicalProtocol 2017-06-11 07:00:14,078 INFO SecurityLogger.org.apache.hadoop.security.authorize.ServiceAuthorizationManager: Authorization successful for job_1496754929367_0270 (auth:TOKEN) for protocol=interface org.apache.hadoop.mapred.TaskUmbilicalProtocol 2017-06-11 07:00:14,079 INFO SecurityLogger.org.apache.hadoop.security.authorize.ServiceAuthorizationManager: Authorization successful for job_1496754929367_0270 (auth:TOKEN) for protocol=interface org.apache.hadoop.mapred.TaskUmbilicalProtocol 2017-06-11 07:00:14,080 INFO SecurityLogger.org.apache.hadoop.security.authorize.ServiceAuthorizationManager: Authorization successful for job_1496754929367_0270 (auth:TOKEN) for protocol=interface org.apache.hadoop.mapred.TaskUmbilicalProtocol 2017-06-11 07:00:14,080 INFO SecurityLogger.org.apache.hadoop.security.authorize.ServiceAuthorizationManager: Authorization successful for job_1496754929367_0270 (auth:TOKEN) for protocol=interface org.apache.hadoop.mapred.TaskUmbilicalProtocol 2017-06-11 07:00:14,081 INFO SecurityLogger.org.apache.hadoop.ipc.Server: Auth successful for job_1496754929367_0270 (auth:SIMPLE) 2017-06-11 07:00:14,081 INFO SecurityLogger.org.apache.hadoop.security.authorize.ServiceAuthorizationManager: Authorization successful for job_1496754929367_0270 (auth:TOKEN) for protocol=interface org.apache.hadoop.mapred.TaskUmbilicalProtocol 2017-06-11 07:00:14,082 INFO SecurityLogger.org.apache.hadoop.security.authorize.ServiceAuthorizationManager: Authorization successful for job_1496754929367_0270 (auth:TOKEN) for protocol=interface org.apache.hadoop.mapred.TaskUmbilicalProtocol 2017-06-11 07:00:14,082 INFO SecurityLogger.org.apache.hadoop.security.authorize.ServiceAuthorizationManager: Authorization successful for job_1496754929367_0270 (auth:TOKEN) for protocol=interface org.apache.hadoop.mapred.TaskUmbilicalProtocol 2017-06-11 07:00:14,082 INFO SecurityLogger.org.apache.hadoop.ipc.Server: Auth successful for job_1496754929367_0270 (auth:SIMPLE) 2017-06-11 07:00:14,083 INFO SecurityLogger.org.apache.hadoop.security.authorize.ServiceAuthorizationManager: Authorization successful for job_1496754929367_0270 (auth:TOKEN) for protocol=interface org.apache.hadoop.mapred.TaskUmbilicalProtocol 2017-06-11 07:00:14,083 INFO SecurityLogger.org.apache.hadoop.security.authorize.ServiceAuthorizationManager: Authorization successful for job_1496754929367_0270 (auth:TOKEN) for protocol=interface org.apache.hadoop.mapred.TaskUmbilicalProtocol 2017-06-11 07:00:14,974 FATAL org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread Thread[Socket Reader #1 for port 36503,5,main] threw an Error. Shutting down now... java.lang.OutOfMemoryError: GC overhead limit exceeded at com.google.protobuf.CodedInputStream.<init>(CodedInputStream.java:573) at com.google.protobuf.CodedInputStream.newInstance(CodedInputStream.java:55) at com.google.protobuf.AbstractMessageLite$Builder.mergeFrom(AbstractMessageLite.java:219) at com.google.protobuf.AbstractMessage$Builder.mergeFrom(AbstractMessage.java:912) at com.google.protobuf.AbstractMessage$Builder.mergeFrom(AbstractMessage.java:267) at com.google.protobuf.AbstractMessageLite$Builder.mergeDelimitedFrom(AbstractMessageLite.java:290) at com.google.protobuf.AbstractMessage$Builder.mergeDelimitedFrom(AbstractMessage.java:926) at com.google.protobuf.AbstractMessageLite$Builder.mergeDelimitedFrom(AbstractMessageLite.java:296) at com.google.protobuf.AbstractMessage$Builder.mergeDelimitedFrom(AbstractMessage.java:918) at org.apache.hadoop.ipc.Server$Connection.decodeProtobufFromStream(Server.java:1994) at org.apache.hadoop.ipc.Server$Connection.processOneRpc(Server.java:1774) at org.apache.hadoop.ipc.Server$Connection.readAndProcess(Server.java:1548) at org.apache.hadoop.ipc.Server$Listener.doRead(Server.java:774) at org.apache.hadoop.ipc.Server$Listener$Reader.doRunLoop(Server.java:647) at org.apache.hadoop.ipc.Server$Listener$Reader.run(Server.java:618) 2017-06-11 07:00:14,980 INFO org.apache.hadoop.util.ExitUtil: Halt with status -1 Message: HaltException ----------------------------------------------------------------------------------------------------- yarn-site.xml <property> <name>yarn.nodemanager.resource.memory-mb</name> <value>122635</value> </property> <property> <name>yarn.nodemanager.resource.cpu-vcores</name> <value>30</value> </property> <property> <name>yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage</name> <value>100.0</value> </property> <property> <name>yarn.scheduler.maximum-allocation-mb</name> <value>102400</value> </property> <property> <name>mapreduce.map.memory.mb</name> <value>6144</value> </property> <property> <name>mapreduce.reduce.memory.mb</name> <value>12288</value> </property> <property> <name>local.child.java.opts</name> <value>-server -Djava.net.preferIPv4Stack=true</value> </property> <property> <name>local.map.task.jvm.heap.mb</name> <value>4300</value> </property> <property> <name>local.reduce.task.jvm.heap.mb</name> <value>8601</value> </property> <property> <name>mapreduce.map.java.opts</name> <value>${local.child.java.opts} -Xmx${local.map.task.jvm.heap.mb}m</value> </property> <property> <name>mapreduce.reduce.java.opts</name> <value>${local.child.java.opts} -Xmx${local.reduce.task.jvm.heap.mb}m</value> </property> ----------------------------------------------------------------------------------------------------- mapred-site.xml <property> <name>local.child.java.opts</name> <value>-server -Djava.net.preferIPv4Stack=true</value> </property> <property> <name>local.map.task.jvm.heap.mb</name> <value>4300</value> </property> <property> <name>local.reduce.task.jvm.heap.mb</name> <value>8601</value> </property> <property> <name>mapred.map.child.java.opts</name> <value>${local.child.java.opts} -Xmx${local.map.task.jvm.heap.mb}m</value> </property> <property> <name>mapred.reduce.child.java.opts</name> <value>${local.child.java.opts} -Xmx${local.reduce.task.jvm.heap.mb}m</value> </property> <property> <name>mapred.job.map.memory.mb</name> <value>6144</value> </property> <property> <name>mapred.job.reduce.memory.mb</name> <value>12288</value> </property> <property> <name>mapred.cluster.max.map.memory.mb</name> <value>15360</value> </property> <property> <name>mapred.cluster.max.reduce.memory.mb</name> <value>15360</value> </property> <property> <name>mapreduce.job.reduce.slowstart.completedmaps</name> <value>0.999</value> </property> hadoop-env.sh # Generic command-specific options add_opt HADOOP_NAMENODE_OPTS "-XX:+UseParNewGC" add_opt HADOOP_NAMENODE_OPTS "-XX:+UseConcMarkSweepGC" add_opt HADOOP_NAMENODE_OPTS "-Xmx8192m" #" add_opt HADOOP_NAMENODE_OPTS "-Dsyslog.tag=namenode" add_opt HADOOP_NAMENODE_OPTS "-Ddaemon.logger=INFO,syslog" add_opt HADOOP_NAMENODE_OPTS "${JMX_OPTS}=7191" add_opt HADOOP_DATANODE_OPTS "-XX:+UseParNewGC" add_opt HADOOP_DATANODE_OPTS "-XX:+UseConcMarkSweepGC" add_opt HADOOP_DATANODE_OPTS "-Xmx4096m" #" add_opt HADOOP_DATANODE_OPTS "-Dsyslog.tag=datanode" add_opt HADOOP_DATANODE_OPTS "-Ddaemon.logger=INFO,syslog" add_opt HADOOP_DATANODE_OPTS "${JMX_OPTS}=7194" add_opt HADOOP_JOURNALNODE_OPTS "-XX:+UseParNewGC" add_opt HADOOP_JOURNALNODE_OPTS "-XX:+UseConcMarkSweepGC" add_opt HADOOP_JOURNALNODE_OPTS "-Xmx1024m" #" add_opt HADOOP_JOURNALNODE_OPTS "-Dsyslog.tag=journalnode" add_opt HADOOP_JOURNALNODE_OPTS "-Ddaemon.logger=INFO,syslog" add_opt HADOOP_JOURNALNODE_OPTS "${JMX_OPTS}=7203" add_opt HADOOP_JOB_HISTORYSERVER_OPTS "-XX:+UseParNewGC" add_opt HADOOP_JOB_HISTORYSERVER_OPTS "-XX:+UseConcMarkSweepGC" add_opt HADOOP_JOB_HISTORYSERVER_OPTS "-Xmx1024m" #" add_opt HADOOP_JOB_HISTORYSERVER_OPTS "-Dsyslog.tag=historyserver" add_opt HADOOP_JOB_HISTORYSERVER_OPTS "-Ddaemon.logger=INFO,syslog" add_opt HADOOP_JOB_HISTORYSERVER_OPTS "${JMX_OPTS}=7201" add_opt YARN_RESOURCEMANAGER_OPTS "-XX:+UseParNewGC" add_opt YARN_RESOURCEMANAGER_OPTS "-XX:+UseConcMarkSweepGC" add_opt YARN_RESOURCEMANAGER_OPTS "-Xmx1024m" #" add_opt YARN_RESOURCEMANAGER_OPTS "-Dsyslog.tag=resourcemanager" add_opt YARN_RESOURCEMANAGER_OPTS "-Ddaemon.logger=INFO,syslog" add_opt YARN_RESOURCEMANAGER_OPTS "${JMX_OPTS}=7204" add_opt YARN_PROXYSERVER_OPTS "-XX:+UseParNewGC" add_opt YARN_PROXYSERVER_OPTS "-XX:+UseConcMarkSweepGC" add_opt YARN_PROXYSERVER_OPTS "-Xmx1024m" #" add_opt YARN_PROXYSERVER_OPTS "-Dsyslog.tag=proxyserver" add_opt YARN_PROXYSERVER_OPTS "-Ddaemon.logger=INFO,syslog" add_opt YARN_PROXYSERVER_OPTS "${JMX_OPTS}=7202" add_opt YARN_NODEMANAGER_OPTS "-XX:+UseParNewGC" add_opt YARN_NODEMANAGER_OPTS "-XX:+UseConcMarkSweepGC" add_opt YARN_NODEMANAGER_OPTS "-Xmx1024m" #" add_opt YARN_NODEMANAGER_OPTS "-Dsyslog.tag=nodemanager" add_opt YARN_NODEMANAGER_OPTS "-Ddaemon.logger=INFO,syslog" add_opt YARN_NODEMANAGER_OPTS "${JMX_OPTS}=7205" # Specific command-specific options add_opt HADOOP_NAMENODE_OPTS "-Dhdfs.audit.logger=INFO,RFAAUDIT" add_opt HADOOP_JOBTRACKER_OPTS "-Dmapred.audit.logger=INFO,MRAUDIT" add_opt HADOOP_JOBTRACKER_OPTS "-Dmapred.jobsummary.logger=INFO,JSA" add_opt HADOOP_TASKTRACKER_OPTS "-Dsecurity.audit.logger=ERROR,console" add_opt HADOOP_TASKTRACKER_OPTS "-Dmapred.audit.logger=ERROR,console" add_opt HADOOP_SECONDARYNAMENODE_OPTS "-Dhdfs.audit.logger=INFO,RFAAUDIT" IMPORTANT: There's a file used for the java mapred application where the following properties are set: "mapreduce.reduce.slowstart.completed.maps" 0.95, "mapreduce.job.reduces" 983, "mapreduce.reduce.shuffle.input.buffer.percent" 0.5, ----------------------------------------------------------------------------------------------------- Some JAVA settings: VM Memory Heap results from "$ java -XshowSettings:all" VM settings: Max. Heap Size (Estimated): 26.52G Ergonomics Machine Class: server Using VM: OpenJDK 64-Bit Server VM The java processes details from one node randomly selected, at the moment this node is running Reduce tasks for the same app that failed before. Number of processes = 12 Memory usage per process = 5493.24 MB Total memory usage = 65918.9 MB From running ps -aux | grep app_id I've got: ........-Xmx8601m....... ----------------------------------------------------------------------------------------------------- If you need more details please let me know. Thanks! Guido.
... View more
Labels:
- Labels:
-
Apache YARN
-
MapReduce
02-20-2017
06:01 AM
Finally I can get my cluster up and running! As msbigelow said two of my three JNs were up and running but bad rdeclared in hdfs-site.xml dfs.namenode.shared.edits.dir property. After change it the namenode service starts! Now everything apperars to be in order. I hope my problem could help in this community. Thanks @saranvisa and @mbigelow!
... View more
- « Previous
-
- 1
- 2
- Next »