I am running a Hive on Spark job which consumes a large amount of data from a Hive external table.
When I run my query, it spawns ~990,000 tasks.
The job is running in a YARN queue with 2800 CPUs available.
I am running 700 executors with 4 cores each.
I've observed the same behavior consistently each time I run the job. Everything starts out well: after ~1.5 hours, the job is 80% done (800k/990k tasks complete). After that, however, Spark stops using all its CPUs. It slowly ramps them down, dropping from 2800 all the way down to double digits. This means the last 190k tasks of this query take significantly longer than the previous 800k.
During the ~4-5 hours it takes to run the last 190k tasks, you will occasionally see it spike all the way up to using 2800 CPUs again for maybe 60s, but it will rapidly drop back down to double digits again.
I'm not sure why Hive/Spark is only using a fraction of the CPUs available to it for the last 190k tasks of this job. I could see as the job gets very close to completing that it wouldn't be able to parallelize only a few tasks among thousands of CPUs. But with 190k left to go it seems far too early for that.
Things I've checked:
No other query is pre-empting its YARN resources (in addition, the fact that this nosedive consistently occurs at the 80% mark seems to make this unlikely. If it was getting pre-empted I would expect it to periodically gain/lose resources instead of predictably.)
This issue occurs whether dynamic allocation is enabled or disabled. If disabled, the query does keep ahold of all the CPUs, it just chooses not to use them. If enabled, the query does spin down executors as it doesn't need them.
I could see if there's a large amount of HDFS skew on the data nodes, maybe some tasks take longer than others. But I'm not sure why so many CPUs are sitting idle with 190k tasks still to start.