Thousands of tasks in shufflemapstage

Trying to figure out one of my regular Spark jobs (Spark 2.2 using RDDs) is taking sooo very long. I could have unrealistic expectations so wanted to check if any of the messages below should room for improvement / tuning.

We have a ten node cluster, 9 worker nodes, 128g RAM & 64 CPU on each node so plenty of capacity. Currently we submit the job with 8 executors and 3 CPU per executor. I'm currently trying to understand where the bottleneck is before I make any adjustments.

The majority of the time in the slow spark job is between these entries:

19/07/03 04:00:39 INFO DAGScheduler: Got job 2 (saveAsTable at with 200 output partitions
19/07/03 04:14:06 INFO DAGScheduler: Job 2 finished: saveAsTable at, took 806.442023 s

In the same log, when I drill into what job 2 is up to, I can see:

19/07/03 04:04:53 INFO DAGScheduler: Submitting 40000 missing tasks from ShuffleMapStage 11 (MapPartitionsRDD[39] at toJavaRDD at (first 15 tasks are for partitions Vector(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14))
19/07/03 04:04:53 INFO YarnClusterScheduler: Adding task set 11.0 with 40000 tasks 

It looks like this is the problem? for the next 10 mins or so I see 40000 lines that look like this:

Starting task 12345.0 in stage 11.0 (TID 2613,, executor 7, partition 38570, NODE_LOCAL, 5254 bytes)

these tasks all run on nodes 1, 3 and 5, and are all NODE_LOCAL or PROCESS_LOCAL (not sure if that matters?)

Any ideas?