Trying to figure out one of my regular Spark jobs (Spark 2.2 using RDDs) is taking sooo very long. I could have unrealistic expectations so wanted to check if any of the messages below should room for improvement / tuning.
We have a ten node cluster, 9 worker nodes, 128g RAM & 64 CPU on each node so plenty of capacity. Currently we submit the job with 8 executors and 3 CPU per executor. I'm currently trying to understand where the bottleneck is before I make any adjustments.
The majority of the time in the slow spark job is between these entries:
19/07/03 04:00:39 INFO DAGScheduler: Got job 2 (saveAsTable at test_job.java:3000) with 200 output partitions
19/07/03 04:14:06 INFO DAGScheduler: Job 2 finished: saveAsTable at test_job.java:3000, took 806.442023 s
In the same log, when I drill into what job 2 is up to, I can see:
19/07/03 04:04:53 INFO DAGScheduler: Submitting 40000 missing tasks from ShuffleMapStage 11 (MapPartitionsRDD at toJavaRDD at test_job.java:980) (first 15 tasks are for partitions Vector(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14))
19/07/03 04:04:53 INFO YarnClusterScheduler: Adding task set 11.0 with 40000 tasks
It looks like this is the problem? for the next 10 mins or so I see 40000 lines that look like this: