Created on 08-26-2018 09:35 AM - edited 09-16-2022 06:38 AM
Hi,
I have 3 node hadoop cluster CDH 5.10.0, Java Version: 1.8.0_171.
When i start all the services, all services starts fine. But after 3-4 mins, all node manager health becomes bad with unexpected exits. soon after that resource manager also stops working. Once the Resource manager is completely stopped, all the node manager again shows good health, but resource manager still in stopped state.
Below are few random logs:
Node manager Log:
Unable to recover container container_1535300340310_0001_01_000001
java.io.IOException: Timeout while waiting for exit code from container_1535300340310_0001_01_000001
at org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor.reacquireContainer(ContainerExecutor.java:199)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:83)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:46)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Resource Manager Logs:
2018-08-05 13:56:29,100 ERROR org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: RECEIVED SIGNAL 1: SIGHUP
2018-08-05 13:56:29,131 ERROR org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager: ExpiredTokenRemover received java.lang.InterruptedException: sleep interrupted
2018-08-05 13:56:29,136 INFO org.mortbay.log: Stopped HttpServer2$SelectChannelConnectorWithSafeStartup@ip-10-0-0-6.ec2.internal:8088
2018-08-05 13:56:29,137 ERROR org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager: ExpiredTokenRemover received java.lang.InterruptedException: sleep interrupted
2018-08-05 13:56:29,145 ERROR org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager: ExpiredTokenRemover received java.lang.InterruptedException: sleep interrupted
2018-08-05 13:56:29,149 INFO org.apache.hadoop.ipc.Server: Stopping server on 8032
2018-08-05 13:56:29,157 INFO org.apache.hadoop.ipc.Server: Stopping IPC Server listener on 8032
2018-08-05 13:56:29,159 INFO org.apache.hadoop.ipc.Server: Stopping IPC Server Responder
2018-08-05 13:56:29,162 INFO org.apache.hadoop.ipc.Server: Stopping server on 8033
2018-08-05 13:56:29,163 INFO org.apache.hadoop.ipc.Server: Stopping IPC Server listener on 8033
2018-08-05 13:56:29,163 INFO org.apache.hadoop.ipc.Server: Stopping IPC Server Responder
2018-08-05 13:56:29,165 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Transitioning to standby state
2018-08-05 13:56:29,166 WARN org.apache.hadoop.yarn.server.resourcemanager.amlauncher.ApplicationMasterLauncher: org.apache.hadoop.yarn.server.resourcemanager.amlauncher.ApplicationMasterLauncher$LauncherThread interrupted. Returning.
2018-08-05 13:56:29,169 INFO org.apache.hadoop.ipc.Server: Stopping server on 8030
2018-08-26 12:20:40,707 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: container_1535300340310_0010_02_000001 Container Transitioned from RUNNING to COMPLETED
2018-08-26 12:20:40,707 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt: Completed container: container_1535300340310_0010_02_000001 in state: COMPLETED event:FINISHED
2018-08-26 12:20:40,707 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=dr.who OPERATION=AM Released Container TARGET=SchedulerApp RESULT=SUCCESS APPID=application_1535300340310_0010 CONTAINERID=container_1535300340310_0010_02_000001
2018-08-26 12:20:40,707 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerNode: Released container container_1535300340310_0010_02_000001 of capacity <memory:1024, vCores:1> on host ip-10-0-0-6.ec2.internal:8041, which currently has 1 containers, <memory:1024, vCores:1> used and <memory:2262, vCores:3> available, release resources=true
2018-08-26 12:20:40,707 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Application attempt appattempt_1535300340310_0010_000002 released container container_1535300340310_0010_02_000001 on node: host: ip-10-0-0-6.ec2.internal:8041 #containers=1 available=2262 used=1024 with event: FINISHED
2018-08-26 12:20:40,708 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: Updating application attempt appattempt_1535300340310_0010_000002 with final state: FAILED, and exit status: 0
2018-08-26 12:20:40,708 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1535300340310_0010_000002 State change from LAUNCHED to FINAL_SAVING
2018-08-26 12:20:40,708 INFO org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: Unregistering app attempt : appattempt_1535300340310_0010_000002
2018-08-26 12:20:40,708 INFO org.apache.hadoop.yarn.server.resourcemanager.security.AMRMTokenSecretManager: Application finished, removing password for appattempt_1535300340310_0010_000002
2018-08-26 12:20:40,708 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1535300340310_0010_000002 State change from FINAL_SAVING to FAILED
2018-08-26 12:20:40,708 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: The number of failed attempts is 2. The max attempts is 2
2018-08-26 12:20:40,709 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Updating application application_1535300340310_0010 with final state: FAILED
2018-08-26 12:20:40,709 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1535300340310_0010 State change from ACCEPTED to FINAL_SAVING
2018-08-26 12:20:40,709 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Updating info for app: application_1535300340310_0010
2018-08-26 12:20:40,709 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Application appattempt_1535300340310_0010_000002 is done. finalState=FAILED
2018-08-26 12:20:40,709 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo: Application application_1535300340310_0010 requests cleared
2018-08-26 12:20:40,765 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Application application_1535300340310_0010 failed 2 times due to AM Container for appattempt_1535300340310_0010_000002 exited with exitCode: 0
For more detailed output, check application tracking page:http://ip-10-0-0-6.ec2.internal:8088/proxy/application_1535300340310_0010/Then, click on links to logs of each attempt.
Diagnostics: Failing this attempt. Failing the application.
2018-08-26 12:20:40,765 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1535300340310_0010 State change from FINAL_SAVING to FAILED
2018-08-26 12:20:40,766 WARN org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=dr.who OPERATION=Application Finished - Failed TARGET=RMAppManager RESULT=FAILURE DESCRIPTION=App failed with state: FAILED PERMISSIONS=Application application_1535300340310_0010 failed 2 times due to AM Container for appattempt_1535300340310_0010_000002 exited with exitCode: 0
For more detailed output, check application tracking page:http://ip-10-0-0-6.ec2.internal:8088/proxy/application_1535300340310_0010/Then, click on links to logs of each attempt.
Diagnostics: Failing this attempt. Failing the
Diagnostics: Failing this attempt. Failing the application.
2018-08-26 12:20:42,322 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1535300340310_0014 State change from FINAL_SAVING to FAILED
2018-08-26 12:20:42,322 WARN org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=dr.who OPERATION=Application Finished - Failed TARGET=RMAppManager RESULT=FAILURE DESCRIPTION=App failed with state: FAILED PERMISSIONS=Application application_1535300340310_0014 failed 2 times due to AM Container for appattempt_1535300340310_0014_000002 exited with exitCode: 0
For more detailed output, check application tracking page:http://ip-10-0-0-6.ec2.internal:8088/proxy/application_1535300340310_0014/Then, click on links to logs of each attempt.
Diagnostics: Failing this attempt. Failing the application. APPID=application_1535300340310_0014
2018-08-26 12:20:42,322 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAppManager$ApplicationSummary: appId=application_1535300340310_0014,name=hadoop,user=dr.who,queue=root.users.dr_dot_who,state=FAILED,trackingUrl=http://ip-10-0-0-6.ec2.internal:8088/cluster/app/application_1535300340310_0014,appMasterHost=N/A,st...
2018-08-26 12:20:42,584 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: container_1535300340310_0013_02_000001 Container Transitioned from ACQUIRED to RUNNING
2018-08-26 12:20:42,595 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt: Making reservation: node=ip-10-0-0-10.ec2.internal app_id=application_1535300340310_0015
2018-08-26 12:20:42,595 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: container_1535300340310_0015_02_000001 Container Transitioned from NEW to RESERVED
2018-08-26 12:20:42,595 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSSchedulerNode: Reserved container container_1535300340310_0015_02_000001 on node host: ip-10-0-0-10.ec2.internal:8041 #containers=3 available=638 used=3072 for application application_1535300340310_0015
Note:
Whenever I perform restart on Yarn service, all the roles starts without any issues, but after some minutes nodemanager shows bad health, and soon after this Resouce manager goes down. Please help in understanding the issue and helping it. Thanks in advance
Created 08-26-2018 09:46 PM
Created 08-27-2018 10:01 AM
Hi Tomas,
I am using RHEL 7.1