In our environment we use capacity scheduler. Pre-emption is disabled.
I came across a situation i could not understand and would appreciate any insights from the forum members.
let us suppose i have four queues in my cluster. q1 to q4.
The capacity allocation for q1, q2 and q3 is 30 minimum and 50 maximum.
The capacity for q4 is 10 minimum and 100 maximum.
So we have jobs running on all queues. But only some 60% of the cluster capacity is being used currently.
One running job currently uses the q4 queue and the usage is over capacity at 200% ie 20% of the cluster capacity.
Another job for q4 is executed, but goes into accepted state.
So the question is even though q4 has a max capacity of 100% and the current usage is only 60% of the capacity of the cluster, still it is not allowing another process to execute on the same queue.
This is the issue I saw.
So is the new job not able to execute in q4 queue because q4 is already 'over capacity'?
Or is it any other reason?
If the former, then can only current running jobs in q4 avail of the 100%?
Or can the capacity allocation for the queues be designed better?
Appreciate the insights.
There are additional parameters which limit the usage of queue resources by a single user or the application master.
They are yarn.scheduler.capacity.<queue-path>.user-limit-factor and yarn.scheduler.capacity.<queue-path>.maximum-am-resource-percent .
The detailed documentation of these capacity-scheduler properties can be referred at http://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/CapacityScheduler.html#Queue_Prope...
|Property||Default||Behaviour and Recommendation|
|yarn.scheduler.capacity.<queue-path>.user-limit-factor||1||If you are submitting jobs as the same user, it is recommended to increase the value above 1. Otherwise the same user can't submit more than one job which exceeds the queue capacity. For q4, a single user can only utilize the max 100% if this is set to 10. This is likely the reason the new job is not getting executed.|
|yarn.scheduler.capacity.<queue-path>.maximum-am-resource-percent||0.1||For example if you consider the default value, in q4 only 10% of the max-capacity can be used. When multiple applications are launched to the same queue, then new applications wont be accepted even if resources are free in the cluster|
sorry i forgot to mention that the user limit for q4 is 3. so i think it should allow three processes to run at the same time.
I checked the queue manager : the maximum am resource for q4 is set to 'inherited%'.
An easy way to check the maximum am resource is in RM UI for queue q4 http://rm-host:8088/cluster/scheduler?openQueues=Queue:%20q4
Check the values for Max Application Master Resources and Used Application Master Resources . Also you can check other values here which will be useful to identify your queue limits configured.
that link is not going through. can u pls check.
btw in my cluster, in reality, the queue in question the minimum is 1% and max is 90%.
below are the configurations for the queue - currently there are no jobs running in that queue :
Used Capacity:0.0% Configured Capacity:1.0%
Configured Max Capacity:90.0%
Absolute Used Capacity:0.0%
Absolute Configured Capacity:1.0%
Absolute Configured Max Capacity:90.0%
Used Resources:<memory:0, vCores:0>
Configured Max Application Master Limit:20.0
Max Application Master Resources:<memory:235520, vCores:1>
Used Application Master Resources:<memory:0, vCores:0>
Max Application Master Resources Per User:<memory:235520, vCores:1>
the max app master resources memory is configured to 235 gb. but when i had the issue the single job was consuming only around 50 gb. thanks.
Can click on the application in RM UI and see what is reported in Diagnostics and paste the content ? It should specify the reason why the job is still in accepted state.