Support Questions
Find answers, ask questions, and share your expertise

Yarn queue limits don't apply

Explorer

Hi,

I'm stuck with a problem and would be really great if someone could help me !

I'm running an HDP 2.5.0.0 cluster. Capacity scheduler is the scheduler used. Let's say I have 4 queues - Q1, Q2, Q3 and Q4 defined under root. Q1,Q2 and Q3 are leaf queues and have minimum and maximum capacities 20% and 40% respectively (queues are similar). Q4 is a parent queue (minimum cap - 40%, max - 100%) and has 4 leaf queues under it - let's say Q41, Q42, Q43 and Q44 (minimum 25, maximum 100 for all 4 sub queues) .

All queues have minimum user limit set to 100% and user limit factor set to 1.

Issue :

When users submit jobs to Q1,Q2 and Q41 and if other queues are empty, I would expect Q1 and Q2 should be at 20% + absolute capacity and Q4 should be 40% +, roughly 25 (Q1), 25 (Q2) and 50 (Q41).

But this is not happening.

Q1 and Q2 always stay at 40% and Q41 or Q4 is getting only 10% absolute capacity.

Any idea how it's happening ?

Thank.

1 ACCEPTED SOLUTION

Expert Contributor

@Priyan S I tried to reproduce your issue, but couldn't (based on the given capacity configuration). I set all the properties based on your description and left everything else on default. With this setup the scheduler works as you expected.

After submitting the first two jobs, they get the maximum available resource per queue (40% each):

11793-1-q1q2.png

Then, after submitting the job for Q41 it starts getting resources (it starts with just the AM container, and then - slowly - gets more and more containers):

11794-2-q1q2q41.png

Finally - when preemption is enabled - the apps in Q1 and Q2 start losing resources and Q41 gets even more, and the final distribution will be like 25% - 25% - 50%, as you expected:

11795-3-balanced.png

This means, that most likely the problem is not caused by the parameters you've provided in your question. I'd suggest you to check other parameters (you can get some of them easily from the ResourceManager Scheduler UI) - the most suspicious thing is that the apps in Q41 get so few resources, so I'd start there. Here are a few things that I can think of:

  • Max Application Master Resources
    • If you submit many small applications to Q41 (and big applications to Q1 and Q2), this might cause them to get hanging
    • also: Max Application Master Resources Per User
  • User Limit Factor
    • The other way around: if you submit just one giant application to Q41
  • Preemption parameters (mentioned in the previous comments)
    • Maybe preemption is not set up properly - although I also tried the same setup with preemption disabled, and it worked exactly the same way, except the balancing phase took longer
    • total_preemption_per_round
    • max_wait_before_kill
    • etc.
  • Node labels
    • You didn't mention in your question if you have node labels configured in your cluster. They can change the scheduling quite drastically.

If no luck, you could also check the logs of the capacity scheduler for further information.

View solution in original post

10 REPLIES 10

Super Collaborator

@Priyan S

Maybe it is because you did not set up pre-emption on Yarn?

Without pre-emption the order in which the jobs were submitted to Q1,Q2 and Q41 is determining the capacity allocations. It may be that since the jobs on Q1 and Q2 were submitted first, they both grabbed the max allocation of their respective queues (40%). When the job for Q41 comes along there is just no more than the remaining 20% for Q4 and/or 100% for any of its subleafs Q41,42,43,44. I don't get why Q41 is only getting 10% and not 20%.

You can look upon pre-emption as a way to help restore the state in which all queues get at least their minimum configured allocation, even though the missing part for one queue A. operating under its minimum might be used by another queue B. operating above its minimum (since it grabbed the excess capacity, up to its maximum, because it was still available at that time).

Without pre-emption queue A. would have to wait for queue B. to release capacity of finished jobs in B.

With pre-emption Yarn will actively free up resources of B. to allocate to A. in the process it might even kill job parts in queue B. to do so.

Explorer

Thanks @Jasper for your reply. But pre-emption is enabled. I can confirm that because YARN jobs spawned under those queues say "Pre-emption enabled" in resource manager.

"I don't get why Q41 is only getting 10% and not 20%."

^ Actually I was talking about the absolute capacity - so it's calculated as 25%40 = 10% Absolute. So the minimum is satisfied. Excess resources are then moved to the queues one level above (Q1, Q2 & Q3).

So, it seems to me like, Queues at a certain level have got more priority than their underlying subleafs. Meaning, if the minimum capacity is satisfied for subleafs, then resource manager puts its parent in a wait list and allocates more resources to other queues of same level as the parent.

This is what I observed, it doesn't make sense though !

Expert Contributor

I think pre-emption is within leaf queues under same parent queue. That is why this behavior is observed.

Explorer

This could partially explain the reason, thanks for the spark. But, I would still expect, in a FIFO queue, resources are given in a round robin manner according to the demand. Then also, there should be more civilized/balanced distribution of resources across same level queues and there by the sub leafs getting a fair portion. confusing ! 😞

Explorer

http://hortonworks.com/blog/better-slas-via-resource-preemption-in-yarns-capacityscheduler/

This doc says -

"preemption works in conjunction with the scheduling flow to make sure that resources freed up at any level in the hierarchy are given back to the right queues in the right level".

Expert Contributor

@Priyan S I tried to reproduce your issue, but couldn't (based on the given capacity configuration). I set all the properties based on your description and left everything else on default. With this setup the scheduler works as you expected.

After submitting the first two jobs, they get the maximum available resource per queue (40% each):

11793-1-q1q2.png

Then, after submitting the job for Q41 it starts getting resources (it starts with just the AM container, and then - slowly - gets more and more containers):

11794-2-q1q2q41.png

Finally - when preemption is enabled - the apps in Q1 and Q2 start losing resources and Q41 gets even more, and the final distribution will be like 25% - 25% - 50%, as you expected:

11795-3-balanced.png

This means, that most likely the problem is not caused by the parameters you've provided in your question. I'd suggest you to check other parameters (you can get some of them easily from the ResourceManager Scheduler UI) - the most suspicious thing is that the apps in Q41 get so few resources, so I'd start there. Here are a few things that I can think of:

  • Max Application Master Resources
    • If you submit many small applications to Q41 (and big applications to Q1 and Q2), this might cause them to get hanging
    • also: Max Application Master Resources Per User
  • User Limit Factor
    • The other way around: if you submit just one giant application to Q41
  • Preemption parameters (mentioned in the previous comments)
    • Maybe preemption is not set up properly - although I also tried the same setup with preemption disabled, and it worked exactly the same way, except the balancing phase took longer
    • total_preemption_per_round
    • max_wait_before_kill
    • etc.
  • Node labels
    • You didn't mention in your question if you have node labels configured in your cluster. They can change the scheduling quite drastically.

If no luck, you could also check the logs of the capacity scheduler for further information.

Explorer

@gnovak Perfect illustration, this kind of doc is not available on internet, wish Hortonworks pin it somewhere 🙂

In your case, was the user limit factor set to 1 ?

I also suspect the apps as to why they were not requesting more capacity. In my case, the workload was different. Q1 and Q2 had 1 app each with less number of containers and large amount of resources. Meanwhile, Q41 had one app with more number of containers but with minimum resources ( containers with min configured memory and vcores in yarn ). Anyway, I'll investigate more by pushing the same load to all queues simultaneously and see.

Thank you for your time, much appreciated 🙂 !

Expert Contributor

Thanks for your kind words 🙂

>>> In your case, was the user limit factor set to 1 ?

At first, I set the user limit factor to 1 for each queue, as you described it in your question. But that way I had to submit multiple applications from multiple users for each queue in order to use maximum resources (user limit factor = 1 means that a single user can only use the configured capacity, for example in case of Q1, 20%), so I raised it later to make the testing easier (submit just one application for each queue).

Explorer

@gnovak I completely figured out the issue - not sure if I can call it an issue !

It was the "user-limit-factor".

In my case, each queue is used by only one user. My assumption was that, if min capacity of a sub leaf (Q41) is 25% and it can grow up to 100% of its parent queue - Q4, then the max user-limit-factor value Q41 can have would be 4 (4*25=100%). But this is not true ! It can grow beyond that - until the absolute max configured capacity ! So the math is : max (user-limit-factor) = absolute max configured capacity / absolute configured capacity.

Absolute values we can find from the Scheduler part in resource manager UI.

Once I adjusted the user-limit-factor to take benefit of the whole capacity by a single user, problem solved ! Thanks for your spark though !

@gnovak @tuxnet Would resource sharing still work if ACLs are configured for separate tenant queues? If ACLs are different for Q1 and Q2, will it still support elasticity and preemption?

Could you also please share the workload/application details that you used for these experiments? I am trying to run some experiments to do a similar test for elasticity and preemption of capacity schedulers. I am using a simple Spark word count application on a large file for the same, but I am not able to get a feel of resource sharing among queues using this application.

Thanks in advance.

; ;