Support Questions
Find answers, ask questions, and share your expertise

Jobs on Queues with node labels stay in "ACCEPTED" state for long time

Jobs on Queues with node labels stay in "ACCEPTED" state for long time

My client is experiencing a very interesting problem in their Prod cluster (node labels are enabled). Sometimes users complain that their jobs have been in “ACCEPTED” state for a while (25 minutes in some cases), then the jobs just start and finish in a minute. We noticed this behavior only happens to jobs submitted to the queue that has node labels enabled on it. All other jobs submitted to other queues don’t face this problem.

The node label has 15 nodes in it and it is not exclusive one. And all the queues have “fair” ordering policy.

The queue is not really highly utilized, sometimes it’s as little as 2 jobs to get stuck in ACCEPTED state.

Any thoughts on what could be causing this?

1 REPLY 1
Highlighted

Re: Jobs on Queues with node labels stay in "ACCEPTED" state for long time

@Eyad Garelnabi

Can you provide more details about the environment?

  • What version of HDP are you using?
  • For the queue with node labels, what is the minimum user limit and user limit factor?
  • Do you have preemption enabled on any of the queues?
  • For the jobs that seem to "hang", have you found certain types of jobs exhibit the behavior more than others?

It's possible you may be running into:

https://issues.apache.org/jira/browse/YARN-3215

https://issues.apache.org/jira/browse/YARN-4140