Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Spark jobs are stuck under YARN Fair Scheduler

avatar
Contributor

Hi, 

 

I have setup YARN Fair-scheduler in Ambari (HDP 3.1.0.0-78) for "Default" queue itself. So far, I haven't added any new queues. 

 

Now, I want to run a simple job against the queue and when I submit the job, the application state is in "ACCEPTED" state forever.  I get the below message in YARN logs:

 

The additional information is given below.  Please help me in fixing this issue at the earliest. 

 

YARN_AM_Message.PNG

For "default" queue, the below parameters are set through "fair-scheduler.xml".  

 

fair_scheduler_screenshot_2.PNG

Also, no jobs are currently running, apart from the one that I have launched. 

 

yarn_job_status.PNG

Given below is the screenshot, which confirms that the maximum AM resource percent is greater than 0.1 

Scheduler_AM_Percent.PNG

 

12 REPLIES 12

avatar
Contributor

Hi Sudhnidra,

Please take a look at:
https://blog.cloudera.com/yarn-fairscheduler-preemption-deep-dive/
https://blog.cloudera.com/untangling-apache-hadoop-yarn-part-3-scheduler-concepts/
https://clouderatemp.wpengine.com/blog/2016/06/untangling-apache-hadoop-yarn-part-4-fair-scheduler-q...

 

What type of FairScheduler are you using:

Steady FairShare
Instantaneous FairShare

 

What is the weight of the default queue you are submitting your apps to?

 

Best,
Lyubomir

 



avatar
Contributor

Hi @ssk26 

From my perspictive you are limiting your default queue to use at minimum 1024MB 0vCores and at maximum 8196MB 0vCores. In both cases no cores are set - when you try to run a job it requires to run with 1024MB memory and 1vCores - it then fails to allocate the 1vCore due to 0vCore min/max restriction and it sends 'exceeds maximum AM resources allowed' 

 

That's why I think the issue is with the core utilization and not with memory. 

HTH

Best,
Lyubomir

avatar
Contributor

Hello,

In your screenshot the 

 

<queueMaxResourcesDefault>

Is set to 8192 mb, 0vcore

 

And your job requires at least 1vcore as seen in the Diagnostics section.

Please try increasing the vcore size in <queueMaxResourcesDefault> and try to run the job again.

 

Best,
Lyubomir