Support Questions

Find answers, ask questions, and share your expertise

Spark executors relationship to yarn containers/queues

avatar
Contributor

We are using HDP 2.4 w/capacity scheduler.

2 queues: spark, default (each min-max: 50%-100%). yarn slot size 1gb (node total mb ~5gb). This is a small cluster w/only 8gb ram per machine.

other teams run spark jobs and we don't control what options are set. So we divided those jobs into their isolated queue.

Problem: When a spark job is running we are not able to run any other job. Other jobs remain in 'ACCEPTED' state until long running job is finished. In our case the other jobs oozie jobs (so launcher & actual action) - very small jobs though. Also we are launching outside of the spark queue. User limit factor is 1 - with "fair" ordering. Preemption is enabled per hortonworks documentation. here are queue status:

Queue Name : default
	State : RUNNING
	Capacity : 50.0%
	Current Capacity : 82.6%
	Maximum Capacity : 100.0%
	Default Node Label expression : 
	Accessible Node Labels : *

Queue Name : spark
	State : RUNNING
	Capacity : 50.0%
	Current Capacity : 108.7%
	Maximum Capacity : 100.0%
	Default Node Label expression : 
	Accessible Node Labels : *

I would expect default to hit near 100 and spark to eventually go down towards 100...

It is my understanding that these queues should be isolated. We can run jobs in the 'default' queue just fine when there isn't a spark job running.

Goal: we want our smaller jobs to finish even if there is a large spark job running. What am I missing here?

Thanks

1 ACCEPTED SOLUTION

avatar
Master Guru

It seems that something is blocking pre-emption. For example, if Spark jobs ask for large executors, "total_preemption_per_round" or another such property may block their release. See here for details how pre-emption works. Try to reduce spark queue max capacity to 75% or so, or talk to the Dev team and ask them to cooperate by using smaller executors, like 1G. Also set the memory allocated for Yarn containers to be a multiple of min container size.

View solution in original post

5 REPLIES 5

avatar
Master Guru

It seems that something is blocking pre-emption. For example, if Spark jobs ask for large executors, "total_preemption_per_round" or another such property may block their release. See here for details how pre-emption works. Try to reduce spark queue max capacity to 75% or so, or talk to the Dev team and ask them to cooperate by using smaller executors, like 1G. Also set the memory allocated for Yarn containers to be a multiple of min container size.

avatar
Contributor

Thanks Predrag,

We want it so a spark job can't take the cluster down. It seems spark ignores the ambari queue properties. Even with max capacity for spark jobs set to 50%, if a developer runs a job with many executors and high mem per executor, our HDP cluster becomes unstable. This shouldn't be possible (coming from cloudera - fair scheduling). We can work with the spark team, but seems we are missing something fundamental since no single job/person should be able to lock resources like this. yarn memory goes to 100%+

avatar
Super Collaborator

There are multiple things here that may be at play given your HDP version

To be clear you have setup a YARN queue for Spark with 50% capacity but Spark jobs can take up more than that (up to 100%) and since these are long running executors, the cluster is locked up until the job finishes. Is that correct?

If yes, then lets see if the following helps. This might be verbose to help other users (in case you already know about these things :))

1) YARN schedulers, fair/capacity, will allow jobs to go to max capacity if resources are available. Given your spark queue is configured to have max=100% this is allowed. So that explains why Spark jobs can take over your cluster. The difference between fair and capacity is that for concurrent jobs that ask for resources at the same time, capacity will prefer the first job and fair will share across all jobs. However if a job has already taken over the cluster, neither will be able to give other queues resources until the job itself returns resources back. This is if preemption is not enabled.

2) YARN schedulers, fair/capacity, support cross queue preemption. So if queue 2 is over its capacity and queue 1 needs resources then resources between max-capacity and capacity will be preempted from 2 and give to 1. Have you enabled preemption in the scheduler for your queues. That should trigger preemption of excess capacity from the Spark queue to other queues when needed.

IIRC, this is how it should behave regardless of fair vs capacity scheduling if new small jobs come in after an existing job has taken over the cluster. Perhaps you could compare your previous fair settings vs new capacity settings to check if preemption is enabled in the former but not in the latter.

avatar
Contributor

Thanks @bikas

so if the spark queue capacity is 40% and max capacity is 50% and user limit is 1 - then in theory regardless of what options are set during spark submit the spark job should never use resources above 50%...correct?

Preemption is enabled here is the config:

yarn.resourcemanager.monitor.enable=true
yarn.resourcemanager.monitor.capacity.preemption.max_wait_before_kill=15000
yarn.resourcemanager.monitor.capacity.preemption.monitor_interval=3000
yarn.resourcemanager.monitor.capacity.preemption.total_preemption_per_round=0.1
yarn.resourcemanager.scheduler.monitor.policies=org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy


ordering policy for both queues is 'fair' and Enable Size Based Weight Ordering is not enabled.

I will try out a few more tests and make sure it really is utilizing resources it shouldn't. If something else seems off in our config - let me know...

avatar
Super Collaborator

Yes. If max capacity for a queue is 50% then it will not be allocated resources > 50% even if the cluster is idle. Obviously this can waste free capacity. Hence this is set to 100%. Hence preemption becomes important for timeliness of giving resources to other queues.

Your configs look ok at first glance but you should check here and here about the configs. You may have to play around with the configs before you get the desired response times for your preemption. If you do not see preemption happening even if its properly configured then you should open a support case (in case you have a support relationship).