Support Questions

Find answers, ask questions, and share your expertise

Mysteriously losing spark executors with many tasks still outstanding

avatar
Explorer

I asked this question on SO, but Cloudera forums may be a better place...

 

https://stackoverflow.com/questions/44143230/losing-spark-executors-with-many-tasks-outstanding

 

Why am I losing my executors and how can I stop it?

3 REPLIES 3

avatar
Champion
Best practice is to include some information from the link provide as it may not work or not exist in the future.

I have only ever seen this type of behavior when dynamic allocation is enabled and attributed it to that. I believe that setting the number of executors should override that and not use dynamic allocation. I may be wrong in that or it may be misbehaving.

Have you tried turning off dynamic allocation altogether? Are other jobs being launched around the time of the start of the decay?

avatar
Explorer

@mbigelowFair point. I did a write up specifically for this forum, but apparently my session had expired so hitting post lost it all, so I didn't want to re-write it; my bad.

 

I did try setting `spark.dynamicAllocation.enabled` to false, which still didn't change the decay issue. I even upped `spark.dynamicAllocation.executorIdleTimeout` to 5 minutes (from 1 minute) in case that was the problem, but it didn't seem to have an effect. My main theory right now is that because the data I'm accessing is on HDFS and is minimally replicated, _maybe_ the executors are dropped because they supposedly don't think they can work... I'm going to try building an external table in a similar vein to my HDFS one using my same data in S3 parquet.

 

Here are the details from SO:

 

Whether I use dynamic allocation or explicitly specify executors (16) and executor cores (8), I have been losing executors even though the tasks outstanding are well beyond the current number of executors.

For example, I have a job (Spark SQL) running with over 27,000 tasks and 14,000 of them were complete, but executors "decayed" from 128 down to as few as 16 with thousands of tasks still outstanding. The log doesn't note any errors/exceptions precipitating these lost executors.

It is a Cloudera CDH 5.10 cluster running on AWS EC2 instances with 136 CPU cores and Spark 2.1.0 (from Cloudera).

17/05/23 18:54:17 INFO yarn.YarnAllocator: Driver requested a total number of 91 executor(s).
17/05/23 18:54:17 INFO yarn.YarnAllocator: Canceling requests for 1 executor container(s) to have a new desired total 91 executors.

It's a slow decay where every minute or so more executors are removed.

Some potentially relevant configuration options:

spark.dynamicAllocation.maxExecutors = 136
spark.dynamicAllocation.minExecutors = 1
spark.dynamicAllocation.initialExecutors = 1
yarn.nodemanager.resource.cpu-vcores = 8
yarn.scheduler.minimum-allocation-vcores = 1
yarn.scheduler.increment-allocation-vcores = 1
yarn.scheduler.maximum-allocation-vcores = 8

Why are the executors decaying away and how can I prevent it?

avatar
Cloudera Employee

Sorry this is nearly a year later, but the behavior you're seeing is likely that spark.executor.instances and --num-executors no longer disables dynamic allocation in 2.X, we now have to explicitly set spark.dynamicAllocation.enabled to false otherwise it just uses the value in the aforementioned properties as the initial executors but still continues to use dynamic allocation to scale the count up and down.

 

That may or may not explain your situation as you mentioned playing with some of those properties.  Additionally the remaining ~13,000 tasks you describe doesn't necessarily mean that there are 13,000 pending tasks, a large portion of those could be for future stages that depend on the current stage and when you're seeing the number of executors reduced it's likely that the current stage was not using all the available executors and they reached the idle limit and were released.  You will want to explicitly disable dynamic allocation if you want a static number of executors, and likely want to review if there's a low task count at the time of "decay" and look at the data to figure out why which could potentially be resolved by simply performing a repartition of the RDD/DS/DF in the stage that has the low level of partitions. 

 

Alternatively there could be resource management configuration or perhaps an actual bug related to the behavior but I would start with the assumption that it's related to the config, data, or partitioning.