Support Questions

Find answers, ask questions, and share your expertise

YARN apps stuck, won't allocate resources

avatar
Explorer

CDH 5.2.0-1.cdh5.2.0.p0.36

 

We had an issue with HDFS filling up causing a number of services to fail and after we cleared space and restarted the cluster we aren't able to run any hive workflows through oozie.  It seems to get stuck allocating resources.

 

No changes were made to YARN resource configurations which seems to be the goto for troubleshooting steps.  We have plenty of resources allocated to YARN containers and there is currently no app limits set in dynamic pool resources.

 

When I start an oozie workflow the oozie:launcher application starts normally but the hive query that is executed is always stuck in ACCEPTED state and never transitions to RUNNING.

 

The oozie:launcher application is accepted and scheduled.

 

2015-01-01 00:47:48,472 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Accepted application application_1420073214126_0001 from user: admin, in queue: default, currently num of applications: 1
2015-01-01 00:47:48,475 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1420073214126_0001 State change from SUBMITTED to ACCEPTED
2015-01-01 00:47:48,475 INFO org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: Registering app attempt : appattempt_1420073214126_0001_000001
2015-01-01 00:47:48,476 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1420073214126_0001_000001 State change from NEW to SUBMITTED
2015-01-01 00:47:48,490 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Added Application Attempt appattempt_1420073214126_0001_000001 to scheduler from user: admin
2015-01-01 00:47:48,492 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1420073214126_0001_000001 State change from SUBMITTED to SCHEDULED

 

oozie:launcher container is allocated and acquired

 

2015-01-01 00:47:54,514 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: container_1420073214126_0001_01_000001 Container Transitioned from NEW to ALLOCATED
2015-01-01 00:47:54,514 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=admin OPERATION=AM Allocated Container TARGET=SchedulerApp RESULT=SUCCESS APPID=application_1420073214126_0001 CONTAINERID=container_1420073214126_0001_01_000001
2015-01-01 00:47:54,514 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerNode: Assigned container container_1420073214126_0001_01_000001 of capacity <memory:1024, vCores:1> on host node:8041, which has 1 containers, <memory:1024, vCores:1> used and <memory:23552, vCores:11> available after allocation
2015-01-01 00:47:54,516 INFO org.apache.hadoop.yarn.server.resourcemanager.security.NMTokenSecretManagerInRM: Sending NMToken for nodeId : ascn07.idc1.level3.com:8041 for container : container_1420073214126_0001_01_000001
2015-01-01 00:47:54,520 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: container_1420073214126_0001_01_000001 Container Transitioned from ALLOCATED to ACQUIRED

 

oozie:launcher application is allocated, launched, and starts running

 

2015-01-01 00:47:54,559 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1420073214126_0001_000001 State change from SCHEDULED to ALLOCATED_SAVING
2015-01-01 00:47:54,568 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1420073214126_0001_000001 State change from ALLOCATED_SAVING to ALLOCATED
2015-01-01 00:47:54,575 INFO org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Launching masterappattempt_1420073214126_0001_000001

<snip>

2015-01-01 00:47:54,834 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1420073214126_0001_000001 State change from ALLOCATED to LAUNCHED

2015-01-01 00:47:55,094 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: container_1420073214126_0001_01_000001 Container Transitioned from ACQUIRED to RUNNING
2015-01-01 00:47:59,724 INFO org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: AM registration appattempt_1420073214126_0001_000001
2015-01-01 00:47:59,725 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=admin IP=1.1.1.1 OPERATION=Register App Master TARGET=ApplicationMasterService RESULT=SUCCESS APPID=application_1420073214126_0001 APPATTEMPTID=appattempt_1420073214126_0001_000001
2015-01-01 00:47:59,725 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1420073214126_0001_000001 State change from LAUNCHED to RUNNING

 

Then the next job begins, which is a hive job.  It transitions from new -> scheduled but a new container is never created/allocated.

 

2015-01-01 00:48:14,119 INFO org.apache.hadoop.yarn.server.resourcemanager.ClientRMService: Application with id 2 submitted by user admin
2015-01-01 00:48:14,119 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Storing application with id application_1420073214126_0002
2015-01-01 00:48:14,119 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=admin IP=1.1.1.1 OPERATION=Submit Application Request TARGET=ClientRMService RESULT=SUCCESS APPID=application_1420073214126_0002
2015-01-01 00:48:14,120 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1420073214126_0002 State change from NEW to NEW_SAVING
2015-01-01 00:48:14,120 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Storing info for app: application_1420073214126_0002
2015-01-01 00:48:14,120 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1420073214126_0002 State change from NEW_SAVING to SUBMITTED
2015-01-01 00:48:14,120 WARN org.apache.hadoop.security.UserGroupInformation: No groups available for user admin
2015-01-01 00:48:14,120 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Accepted application application_1420073214126_0002 from user: admin, in queue: default, currently num of applications: 2
2015-01-01 00:48:14,121 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1420073214126_0002 State change from SUBMITTED to ACCEPTED
2015-01-01 00:48:14,121 INFO org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: Registering app attempt : appattempt_1420073214126_0002_000001
2015-01-01 00:48:14,121 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1420073214126_0002_000001 State change from NEW to SUBMITTED
2015-01-01 00:48:14,121 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Added Application Attempt appattempt_1420073214126_0002_000001 to scheduler from user: admin
2015-01-01 00:48:14,121 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1420073214126_0002_000001 State change from SUBMITTED to SCHEDULED

 

At this point the job never progresses.  In cm->yarn applications it has a status of "Pending", on the resource manager UI it has a state of "ACCEPTED" but never transitions into "RUNNING".

 

This issue is mentioned in a blog post from april (#5) http://blog.cloudera.com/blog/2014/04/apache-hadoop-yarn-avoiding-6-time-consuming-gotchas/

 

The suggested fix of adding a value to "max running apps" has no effect.

 

1 ACCEPTED SOLUTION

avatar
Explorer

User error.

 

Everything was fine with the resource pools, but there was a default user limit set.

View solution in original post

8 REPLIES 8

avatar
Explorer

User error.

 

Everything was fine with the resource pools, but there was a default user limit set.

avatar
New Contributor

I'm having the same issue, al jobs get stuck in accepted. This is a new install. Trying to do a simple hive query (select count(*) from table)

 

Can you tell me what the solution was????

avatar
Explorer

In our case I had accidently set a default "user limits" to 1 for "max running apps per user".  All of our jobs required more than one application to run at a time per user.

 

This is configured in Clusters -> Dynamic resource pools -> Configuration -> User limits -> Default settings

 

It could also be that your jobs are attempting to wait for resources to become available before starting.  Perhaps you have too few resources available for what is being requested?

avatar
New Contributor
I had no user limits set. Do I need to set them, or leave them blank?

What do you consider too few resources? I have 1 Master server with 300GB
disk and 32GB of men and 3 slaves with 3TB disk and 32GB men. This is a
brand new install with only 1 user. 99% CPU idle, 27GB mem free.



avatar
Super Collaborator

Check this part of the documentation for YARN tuning it explains it all. You might have a default value set which you have overlooked causing the issue.

 

Wilfred

avatar
New Contributor

Hi, Were you able to figure out the solution?, I am stuck in the same situation

avatar
New Contributor

Hi,

 

 

Can you please check the Node Managers logs. If the logs show the follwoing message: DiskSpace reached the threshold value.

 

This is due to disk space of you cluster.
Node managers are running fine, but they already reached the threshold value for this following parameter.

yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage = 90.0 % (default) and usage is beyond the 90%  per disk.

 

This makes Node Managers are unhealthy status. If Node Managers are in unhealthy status Resource Manager won't allocate resources to run your applications.

You can increase the value to bigger like 95%.

 

The best solution is: add a few more disks having enough space to both HDFS data nodes and Yarn Node Managers.

avatar
Explorer

Hi Guys,

I am facing similar issue. I have a new installation of Cloudera and i am trying to run a simple Map reduce Pi Example and also a spark Job. Map Reduce job gets stuck at the map 0% and reduce 0% step as shown below and Spark job is waiting spends lot of time in ACCEPTED state. I checked the user limit and it is blank for me.

 

test@spark-1 ~]$ sudo -u hdfs hadoop jar /data/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar pi 10 100
Number of Maps  = 10
Samples per Map = 100
Wrote input for Map #0
Wrote input for Map #1
Wrote input for Map #2
Wrote input for Map #3
Wrote input for Map #4
Wrote input for Map #5
Wrote input for Map #6
Wrote input for Map #7
Wrote input for Map #8
Wrote input for Map #9
Starting Job
18/10/16 12:33:25 INFO input.FileInputFormat: Total input paths to process : 10
18/10/16 12:33:26 INFO mapreduce.JobSubmitter: number of splits:10
18/10/16 12:33:26 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1539705370715_0002
18/10/16 12:33:26 INFO impl.YarnClientImpl: Submitted application application_1539705370715_0002
18/10/16 12:33:26 INFO mapreduce.Job: The url to track the job: http://spark-4:8088/proxy/application_1539705370715_0002/
18/10/16 12:33:26 INFO mapreduce.Job: Running job: job_1539705370715_0002
18/10/16 12:33:31 INFO mapreduce.Job: Job job_1539705370715_0002 running in uber mode : false
18/10/16 12:33:31 INFO mapreduce.Job:  map 0% reduce 0%

I made multiple config changes, but cannot find a solution for this. The only error i could trace was in the nodemanager log file as below :

ERROR org.apache.hadoop.yarn.server.nodemanager.NodeManager: RECEIVED SIGNAL 15: SIGTERM

I tried checking various properties discussed in this thread, but i still have that issue. Can someone please help in solving this issue? Please let me know what all details i can provide.