Created on 12-31-2014 06:03 PM - edited 09-16-2022 02:17 AM
CDH 5.2.0-1.cdh5.2.0.p0.36
We had an issue with HDFS filling up causing a number of services to fail and after we cleared space and restarted the cluster we aren't able to run any hive workflows through oozie. It seems to get stuck allocating resources.
No changes were made to YARN resource configurations which seems to be the goto for troubleshooting steps. We have plenty of resources allocated to YARN containers and there is currently no app limits set in dynamic pool resources.
When I start an oozie workflow the oozie:launcher application starts normally but the hive query that is executed is always stuck in ACCEPTED state and never transitions to RUNNING.
The oozie:launcher application is accepted and scheduled.
2015-01-01 00:47:48,472 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Accepted application application_1420073214126_0001 from user: admin, in queue: default, currently num of applications: 1
2015-01-01 00:47:48,475 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1420073214126_0001 State change from SUBMITTED to ACCEPTED
2015-01-01 00:47:48,475 INFO org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: Registering app attempt : appattempt_1420073214126_0001_000001
2015-01-01 00:47:48,476 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1420073214126_0001_000001 State change from NEW to SUBMITTED
2015-01-01 00:47:48,490 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Added Application Attempt appattempt_1420073214126_0001_000001 to scheduler from user: admin
2015-01-01 00:47:48,492 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1420073214126_0001_000001 State change from SUBMITTED to SCHEDULED
oozie:launcher container is allocated and acquired
2015-01-01 00:47:54,514 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: container_1420073214126_0001_01_000001 Container Transitioned from NEW to ALLOCATED
2015-01-01 00:47:54,514 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=admin OPERATION=AM Allocated Container TARGET=SchedulerApp RESULT=SUCCESS APPID=application_1420073214126_0001 CONTAINERID=container_1420073214126_0001_01_000001
2015-01-01 00:47:54,514 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerNode: Assigned container container_1420073214126_0001_01_000001 of capacity <memory:1024, vCores:1> on host node:8041, which has 1 containers, <memory:1024, vCores:1> used and <memory:23552, vCores:11> available after allocation
2015-01-01 00:47:54,516 INFO org.apache.hadoop.yarn.server.resourcemanager.security.NMTokenSecretManagerInRM: Sending NMToken for nodeId : ascn07.idc1.level3.com:8041 for container : container_1420073214126_0001_01_000001
2015-01-01 00:47:54,520 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: container_1420073214126_0001_01_000001 Container Transitioned from ALLOCATED to ACQUIRED
oozie:launcher application is allocated, launched, and starts running
2015-01-01 00:47:54,559 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1420073214126_0001_000001 State change from SCHEDULED to ALLOCATED_SAVING
2015-01-01 00:47:54,568 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1420073214126_0001_000001 State change from ALLOCATED_SAVING to ALLOCATED
2015-01-01 00:47:54,575 INFO org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Launching masterappattempt_1420073214126_0001_000001
<snip>
2015-01-01 00:47:54,834 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1420073214126_0001_000001 State change from ALLOCATED to LAUNCHED
2015-01-01 00:47:55,094 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: container_1420073214126_0001_01_000001 Container Transitioned from ACQUIRED to RUNNING
2015-01-01 00:47:59,724 INFO org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: AM registration appattempt_1420073214126_0001_000001
2015-01-01 00:47:59,725 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=admin IP=1.1.1.1 OPERATION=Register App Master TARGET=ApplicationMasterService RESULT=SUCCESS APPID=application_1420073214126_0001 APPATTEMPTID=appattempt_1420073214126_0001_000001
2015-01-01 00:47:59,725 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1420073214126_0001_000001 State change from LAUNCHED to RUNNING
Then the next job begins, which is a hive job. It transitions from new -> scheduled but a new container is never created/allocated.
2015-01-01 00:48:14,119 INFO org.apache.hadoop.yarn.server.resourcemanager.ClientRMService: Application with id 2 submitted by user admin
2015-01-01 00:48:14,119 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Storing application with id application_1420073214126_0002
2015-01-01 00:48:14,119 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=admin IP=1.1.1.1 OPERATION=Submit Application Request TARGET=ClientRMService RESULT=SUCCESS APPID=application_1420073214126_0002
2015-01-01 00:48:14,120 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1420073214126_0002 State change from NEW to NEW_SAVING
2015-01-01 00:48:14,120 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Storing info for app: application_1420073214126_0002
2015-01-01 00:48:14,120 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1420073214126_0002 State change from NEW_SAVING to SUBMITTED
2015-01-01 00:48:14,120 WARN org.apache.hadoop.security.UserGroupInformation: No groups available for user admin
2015-01-01 00:48:14,120 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Accepted application application_1420073214126_0002 from user: admin, in queue: default, currently num of applications: 2
2015-01-01 00:48:14,121 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1420073214126_0002 State change from SUBMITTED to ACCEPTED
2015-01-01 00:48:14,121 INFO org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: Registering app attempt : appattempt_1420073214126_0002_000001
2015-01-01 00:48:14,121 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1420073214126_0002_000001 State change from NEW to SUBMITTED
2015-01-01 00:48:14,121 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Added Application Attempt appattempt_1420073214126_0002_000001 to scheduler from user: admin
2015-01-01 00:48:14,121 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1420073214126_0002_000001 State change from SUBMITTED to SCHEDULED
At this point the job never progresses. In cm->yarn applications it has a status of "Pending", on the resource manager UI it has a state of "ACCEPTED" but never transitions into "RUNNING".
This issue is mentioned in a blog post from april (#5) http://blog.cloudera.com/blog/2014/04/apache-hadoop-yarn-avoiding-6-time-consuming-gotchas/
The suggested fix of adding a value to "max running apps" has no effect.
Created 01-09-2015 02:05 PM
User error.
Everything was fine with the resource pools, but there was a default user limit set.
Created 01-09-2015 02:05 PM
User error.
Everything was fine with the resource pools, but there was a default user limit set.
Created 07-07-2015 03:32 PM
I'm having the same issue, al jobs get stuck in accepted. This is a new install. Trying to do a simple hive query (select count(*) from table)
Can you tell me what the solution was????
Created 07-07-2015 03:41 PM
In our case I had accidently set a default "user limits" to 1 for "max running apps per user". All of our jobs required more than one application to run at a time per user.
This is configured in Clusters -> Dynamic resource pools -> Configuration -> User limits -> Default settings
It could also be that your jobs are attempting to wait for resources to become available before starting. Perhaps you have too few resources available for what is being requested?
Created 07-07-2015 04:17 PM
Created 07-08-2015 10:28 PM
Check this part of the documentation for YARN tuning it explains it all. You might have a default value set which you have overlooked causing the issue.
Wilfred
Created 09-07-2018 06:23 AM
Hi, Were you able to figure out the solution?, I am stuck in the same situation
Created 09-04-2015 04:21 AM
Hi,
Can you please check the Node Managers logs. If the logs show the follwoing message: DiskSpace reached the threshold value.
This is due to disk space of you cluster.
Node managers are running fine, but they already reached the threshold value for this following parameter.
yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage = 90.0 % (default) and usage is beyond the 90% per disk.
This makes Node Managers are unhealthy status. If Node Managers are in unhealthy status Resource Manager won't allocate resources to run your applications.
You can increase the value to bigger like 95%.
The best solution is: add a few more disks having enough space to both HDFS data nodes and Yarn Node Managers.
Created 10-16-2018 09:49 AM
Hi Guys,
I am facing similar issue. I have a new installation of Cloudera and i am trying to run a simple Map reduce Pi Example and also a spark Job. Map Reduce job gets stuck at the map 0% and reduce 0% step as shown below and Spark job is waiting spends lot of time in ACCEPTED state. I checked the user limit and it is blank for me.
test@spark-1 ~]$ sudo -u hdfs hadoop jar /data/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar pi 10 100 Number of Maps = 10 Samples per Map = 100 Wrote input for Map #0 Wrote input for Map #1 Wrote input for Map #2 Wrote input for Map #3 Wrote input for Map #4 Wrote input for Map #5 Wrote input for Map #6 Wrote input for Map #7 Wrote input for Map #8 Wrote input for Map #9 Starting Job 18/10/16 12:33:25 INFO input.FileInputFormat: Total input paths to process : 10 18/10/16 12:33:26 INFO mapreduce.JobSubmitter: number of splits:10 18/10/16 12:33:26 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1539705370715_0002 18/10/16 12:33:26 INFO impl.YarnClientImpl: Submitted application application_1539705370715_0002 18/10/16 12:33:26 INFO mapreduce.Job: The url to track the job: http://spark-4:8088/proxy/application_1539705370715_0002/ 18/10/16 12:33:26 INFO mapreduce.Job: Running job: job_1539705370715_0002 18/10/16 12:33:31 INFO mapreduce.Job: Job job_1539705370715_0002 running in uber mode : false 18/10/16 12:33:31 INFO mapreduce.Job: map 0% reduce 0%
I made multiple config changes, but cannot find a solution for this. The only error i could trace was in the nodemanager log file as below :
ERROR org.apache.hadoop.yarn.server.nodemanager.NodeManager: RECEIVED SIGNAL 15: SIGTERM
I tried checking various properties discussed in this thread, but i still have that issue. Can someone please help in solving this issue? Please let me know what all details i can provide.