Hello, I hope I get get some advice on this issue which I have been struggling with for a while.
I have an application that is very resource hungry. It runs for several hours before completion. It is running with Spark on Yarn.
Ideally, I would like it to consume as much resources as possible, but release some resources when other applications are submitted to the cluster.
I defined 2 queues, one with 67% and one with 33% . I enabled pre-emption, and gave the higher queue a timeout of 10 seconds for pre-emption. My application is submitted to the lower queue, and starts running. But when I submit another application to the higher queue, it is stuck in ACCEPTED status and does not get processed until the first application is finished. I was expecting the first application to release some resources and allow the second one to start. What am I missing?
This sounds a lot like a job for dynamic executor allocation on YARN. It can shut down and add back executors in response to demand. It doesn't involve queues even.
Could you elaborate a bit more? thanks!
I tried enabling spark.dynamicAllocation and spark.shuffle.service and followed the steps in https://spark.apache.org/docs/1.2.0/job-scheduling.html, but could not find a jar file called spark-<version>-yarn-shuffle.jar in my cloudera distribution. I am using CDH 5.3.0 .
When I tried running a job anyways, I got this error:
org.apache.hadoop.yarn.exceptions.InvalidAuxServiceException: The auxService:spark_shuffle does not exist
I saw that CDH 5.4 includes the shuffling service I mentioned above. I will try it out once I get my hands on CDH 5.4.