question Re: Spark - YARN Capacity Scheduler in Support Questions

Spark - YARN Capacity Scheduler

nismaily — Thu, 26 May 2016 02:09:42 GMT

Can we configure the capacity scheduler in such a way that a spark job only runs when it can procure enough resources?

In the current FIFO setup a spark job will start running if it can get a few of the required executors, but the job will fail because it couldn't get enough resources.

I would like the spark job to only start when it can procure all the required resources.

Re: Spark - YARN Capacity Scheduler

yjagadeesan — Thu, 26 May 2016 02:25:43 GMT

@Nasheb Ismaily

You might need to set the minimum-user-limit-percent (say 30%)

yarn.scheduler.capacity.root.support.services.minimum-user- limit-percent

Unless the 30% of the queue capacity is available , the job will not start.

Re: Spark - YARN Capacity Scheduler

nismaily — Thu, 26 May 2016 02:27:11 GMT

Thank you ,I'll try this out

Re: Spark - YARN Capacity Scheduler

ravi1 — Thu, 26 May 2016 02:31:18 GMT

This will only work at user level, not at job level. So, if the user has other jobs and he gets the % of queue, spark job will start even before it can get that.

Re: Spark - YARN Capacity Scheduler

ravi1 — Thu, 26 May 2016 04:04:03 GMT

If you are not using dynamic allocation, your job that is submitted will not start until it gets all the resources. You are asking for N number of executors, so YARN will not let you proceed until you get all executors.

If you are using dynamic allocation, then setting spark.dynamicAllocation.minExecutors to a higher value will mean that the job gets scheduled only if minExecutors are met.

Re: Spark - YARN Capacity Scheduler

nismaily — Thu, 26 May 2016 04:19:04 GMT

Thanks Ravi, this is very close to what I need.

Question, spark.dynamicAllocation.minExecutors seems to be a global property in spark-defautls

Is there a way to set this property on a job by job basis?

Spark job1 -> min executors 8

Spark job2 -> min executors 5

Re: Spark - YARN Capacity Scheduler

yjagadeesan — Thu, 26 May 2016 04:20:15 GMT

I agree @Ravi Mutyala

Re: Spark - YARN Capacity Scheduler

ravi1 — Thu, 26 May 2016 04:41:34 GMT

I think its spark.dynamicAllocation.initialExecutors that you can set per job. Try putting in a property file and passing it with --properties-file. Haven't tried this myself, so let me know how it works.

Re: Spark - YARN Capacity Scheduler

nismaily — Wed, 01 Jun 2016 07:48:58 GMT

Thanks Ravi,

I had to:

1) Copy spark shuffle jars to nodemanager classpaths on all nodes

2) add spark_shuffle to yarn.nodemanager.aux-services, set yarn.nodemanager.aux-services.spark_shuffle.class to org.apache.spark.network.yarn.YarnShuffleService in yarn-site.xml (via Ambari)

3) Restart all nodemanagers

4) Add the following to spark-defaults.conf

spark.dynamicAllocation.enabled true

spark.shuffle.service.enabled true

5) Set these parameters per job basis

spark.dynamicAllocation.initialExecutors=#

spark.dynamicAllocation.minExecutors=#