@mbigelowFair point. I did a write up specifically for this forum, but apparently my session had expired so hitting post lost it all, so I didn't want to re-write it; my bad.
I did try setting `spark.dynamicAllocation.enabled` to false, which still didn't change the decay issue. I even upped `spark.dynamicAllocation.executorIdleTimeout` to 5 minutes (from 1 minute) in case that was the problem, but it didn't seem to have an effect. My main theory right now is that because the data I'm accessing is on HDFS and is minimally replicated, _maybe_ the executors are dropped because they supposedly don't think they can work... I'm going to try building an external table in a similar vein to my HDFS one using my same data in S3 parquet.
Here are the details from SO:
Whether I use dynamic allocation or explicitly specify executors (16) and executor cores (8), I have been losing executors even though the tasks outstanding are well beyond the current number of executors.
For example, I have a job (Spark SQL) running with over 27,000 tasks and 14,000 of them were complete, but executors "decayed" from 128 down to as few as 16 with thousands of tasks still outstanding. The log doesn't note any errors/exceptions precipitating these lost executors.
It is a Cloudera CDH 5.10 cluster running on AWS EC2 instances with 136 CPU cores and Spark 2.1.0 (from Cloudera).
17/05/23 18:54:17 INFO yarn.YarnAllocator: Driver requested a total number of 91 executor(s). 17/05/23 18:54:17 INFO yarn.YarnAllocator: Canceling requests for 1 executor container(s) to have a new desired total 91 executors.
It's a slow decay where every minute or so more executors are removed.
Some potentially relevant configuration options:
spark.dynamicAllocation.maxExecutors = 136 spark.dynamicAllocation.minExecutors = 1 spark.dynamicAllocation.initialExecutors = 1 yarn.nodemanager.resource.cpu-vcores = 8 yarn.scheduler.minimum-allocation-vcores = 1 yarn.scheduler.increment-allocation-vcores = 1 yarn.scheduler.maximum-allocation-vcores = 8
Why are the executors decaying away and how can I prevent it?