Created on 12-31-2014 06:03 PM - edited 09-16-2022 02:17 AM
CDH 5.2.0-1.cdh5.2.0.p0.36
We had an issue with HDFS filling up causing a number of services to fail and after we cleared space and restarted the cluster we aren't able to run any hive workflows through oozie. It seems to get stuck allocating resources.
No changes were made to YARN resource configurations which seems to be the goto for troubleshooting steps. We have plenty of resources allocated to YARN containers and there is currently no app limits set in dynamic pool resources.
When I start an oozie workflow the oozie:launcher application starts normally but the hive query that is executed is always stuck in ACCEPTED state and never transitions to RUNNING.
The oozie:launcher application is accepted and scheduled.
2015-01-01 00:47:48,472 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Accepted application application_1420073214126_0001 from user: admin, in queue: default, currently num of applications: 1
2015-01-01 00:47:48,475 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1420073214126_0001 State change from SUBMITTED to ACCEPTED
2015-01-01 00:47:48,475 INFO org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: Registering app attempt : appattempt_1420073214126_0001_000001
2015-01-01 00:47:48,476 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1420073214126_0001_000001 State change from NEW to SUBMITTED
2015-01-01 00:47:48,490 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Added Application Attempt appattempt_1420073214126_0001_000001 to scheduler from user: admin
2015-01-01 00:47:48,492 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1420073214126_0001_000001 State change from SUBMITTED to SCHEDULED
oozie:launcher container is allocated and acquired
2015-01-01 00:47:54,514 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: container_1420073214126_0001_01_000001 Container Transitioned from NEW to ALLOCATED
2015-01-01 00:47:54,514 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=admin OPERATION=AM Allocated Container TARGET=SchedulerApp RESULT=SUCCESS APPID=application_1420073214126_0001 CONTAINERID=container_1420073214126_0001_01_000001
2015-01-01 00:47:54,514 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerNode: Assigned container container_1420073214126_0001_01_000001 of capacity <memory:1024, vCores:1> on host node:8041, which has 1 containers, <memory:1024, vCores:1> used and <memory:23552, vCores:11> available after allocation
2015-01-01 00:47:54,516 INFO org.apache.hadoop.yarn.server.resourcemanager.security.NMTokenSecretManagerInRM: Sending NMToken for nodeId : ascn07.idc1.level3.com:8041 for container : container_1420073214126_0001_01_000001
2015-01-01 00:47:54,520 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: container_1420073214126_0001_01_000001 Container Transitioned from ALLOCATED to ACQUIRED
oozie:launcher application is allocated, launched, and starts running
2015-01-01 00:47:54,559 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1420073214126_0001_000001 State change from SCHEDULED to ALLOCATED_SAVING
2015-01-01 00:47:54,568 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1420073214126_0001_000001 State change from ALLOCATED_SAVING to ALLOCATED
2015-01-01 00:47:54,575 INFO org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Launching masterappattempt_1420073214126_0001_000001
<snip>
2015-01-01 00:47:54,834 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1420073214126_0001_000001 State change from ALLOCATED to LAUNCHED
2015-01-01 00:47:55,094 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: container_1420073214126_0001_01_000001 Container Transitioned from ACQUIRED to RUNNING
2015-01-01 00:47:59,724 INFO org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: AM registration appattempt_1420073214126_0001_000001
2015-01-01 00:47:59,725 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=admin IP=1.1.1.1 OPERATION=Register App Master TARGET=ApplicationMasterService RESULT=SUCCESS APPID=application_1420073214126_0001 APPATTEMPTID=appattempt_1420073214126_0001_000001
2015-01-01 00:47:59,725 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1420073214126_0001_000001 State change from LAUNCHED to RUNNING
Then the next job begins, which is a hive job. It transitions from new -> scheduled but a new container is never created/allocated.
2015-01-01 00:48:14,119 INFO org.apache.hadoop.yarn.server.resourcemanager.ClientRMService: Application with id 2 submitted by user admin
2015-01-01 00:48:14,119 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Storing application with id application_1420073214126_0002
2015-01-01 00:48:14,119 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=admin IP=1.1.1.1 OPERATION=Submit Application Request TARGET=ClientRMService RESULT=SUCCESS APPID=application_1420073214126_0002
2015-01-01 00:48:14,120 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1420073214126_0002 State change from NEW to NEW_SAVING
2015-01-01 00:48:14,120 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Storing info for app: application_1420073214126_0002
2015-01-01 00:48:14,120 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1420073214126_0002 State change from NEW_SAVING to SUBMITTED
2015-01-01 00:48:14,120 WARN org.apache.hadoop.security.UserGroupInformation: No groups available for user admin
2015-01-01 00:48:14,120 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Accepted application application_1420073214126_0002 from user: admin, in queue: default, currently num of applications: 2
2015-01-01 00:48:14,121 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1420073214126_0002 State change from SUBMITTED to ACCEPTED
2015-01-01 00:48:14,121 INFO org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: Registering app attempt : appattempt_1420073214126_0002_000001
2015-01-01 00:48:14,121 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1420073214126_0002_000001 State change from NEW to SUBMITTED
2015-01-01 00:48:14,121 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Added Application Attempt appattempt_1420073214126_0002_000001 to scheduler from user: admin
2015-01-01 00:48:14,121 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1420073214126_0002_000001 State change from SUBMITTED to SCHEDULED
At this point the job never progresses. In cm->yarn applications it has a status of "Pending", on the resource manager UI it has a state of "ACCEPTED" but never transitions into "RUNNING".
This issue is mentioned in a blog post from april (#5) http://blog.cloudera.com/blog/2014/04/apache-hadoop-yarn-avoiding-6-time-consuming-gotchas/
The suggested fix of adding a value to "max running apps" has no effect.