I am configuring Dynamic Resource Pooling in our cluster.
My case is, I need 2 pools : Default and Priority.
Default is already in place.
My experiments and observations :
1. Default as it is. Created a new pool Priority with min/max memory/cores. My cluster has 3.5 TB memory and 720 cores. Min/Max memory was 100GB/2TB, cores 10/400.
So, When I run a job in Default queue which uses full 3.5 TB memory and then I submit a second job in the same queue. Both start running, thats fine.
but, when I run a job in Priority queue which uses full memory(2TB/max) and then submit another job in Priority queue, it goes into pending state till first one is finished. However I was able to submit/run in default queue.
2. I removed the min/max things in Priority queue. Only difference between Default and Priority was weightage. Priority 3(75%) and default 1(25%).
Now, when I run a job in Priority which uses full memory and then again submit another job. it gets running. No Problem.
But, when I run a job in Default which uses up all 3.5 TB memory and then I submit a job in Priority queue, it gets running as well but with a minimum container memory of 3GB i.e. one container. If I submit one more, it again gets a single container and uses 3 GB memory.
So, to check if this can be sorted out, I increased minimum memory of priority queue to 800 GB but no luck, it still uses 3GB memory for a job if resources are allocated to other pools.
So, what is the use of weightage over here?
My expected result is:
Priority queue should get proper memory(not 3GB or 1 container) irrespective of the state of cluster.
Note: When i say job uses 3GB, means other containers go into pending state.
The job/command I am using is :
spark-shell --master yarn-client --num-executors 75 --executor-memory 30g --queue Priority
75 and 30 are variables.
I have to assume that this is a FairScheduler setuo that you are using. Is it possible to grab the FS xml file from a RM host and attach it so we have a good overview of what is configured? You can get that by going to a RM instance in CM via YARN -> Instances -> RM instance -> Processes and copy the fair-scheduler.xml
You are runnnig a Spark application, have you checked with a simple MR or Spark Pi example and made sure that that works for running multiple applications in the same queue. They do not have to use huge containers.
Also check what you have set for the maximum running applications in the priority queue? is that higher than 1?
Are you using the same user for all these applications and do you have a user limit set?
All these answers should be available via the FS config but they give you something to look at too.