Support Questions

Find answers, ask questions, and share your expertise

Spark jobs are stuck under YARN Fair Scheduler

avatar
Contributor

Hi, 

 

I have setup YARN Fair-scheduler in Ambari (HDP 3.1.0.0-78) for "Default" queue itself. So far, I haven't added any new queues. 

 

Now, I want to run a simple job against the queue and when I submit the job, the application state is in "ACCEPTED" state forever.  I get the below message in YARN logs:

 

The additional information is given below.  Please help me in fixing this issue at the earliest. 

 

YARN_AM_Message.PNG

For "default" queue, the below parameters are set through "fair-scheduler.xml".  

 

fair_scheduler_screenshot_2.PNG

Also, no jobs are currently running, apart from the one that I have launched. 

 

yarn_job_status.PNG

Given below is the screenshot, which confirms that the maximum AM resource percent is greater than 0.1 

Scheduler_AM_Percent.PNG

 

12 REPLIES 12

avatar
Contributor

Hi Sudhnidra,

Please take a look at:
https://blog.cloudera.com/yarn-fairscheduler-preemption-deep-dive/
https://blog.cloudera.com/untangling-apache-hadoop-yarn-part-3-scheduler-concepts/
https://clouderatemp.wpengine.com/blog/2016/06/untangling-apache-hadoop-yarn-part-4-fair-scheduler-q...

 

What type of FairScheduler are you using:

Steady FairShare
Instantaneous FairShare

 

What is the weight of the default queue you are submitting your apps to?

 

Best,
Lyubomir

 



avatar
Contributor

Hi @ssk26 

From my perspictive you are limiting your default queue to use at minimum 1024MB 0vCores and at maximum 8196MB 0vCores. In both cases no cores are set - when you try to run a job it requires to run with 1024MB memory and 1vCores - it then fails to allocate the 1vCore due to 0vCore min/max restriction and it sends 'exceeds maximum AM resources allowed' 

 

That's why I think the issue is with the core utilization and not with memory. 

HTH

Best,
Lyubomir

avatar
Contributor

Hello,

In your screenshot the 

 

<queueMaxResourcesDefault>

Is set to 8192 mb, 0vcore

 

And your job requires at least 1vcore as seen in the Diagnostics section.

Please try increasing the vcore size in <queueMaxResourcesDefault> and try to run the job again.

 

Best,
Lyubomir