I am getting desperate here! My Spark2 jobs take hours then get stuck!
I have a 4 node cluster each with 16GB RAM and 8 cores. I run HDP 2.6, Spark 2.1 and Zeppelin 0.7.
I have:
- spark.executor.instances 11
- spark.executor.cores 2
- spark.executor.memory 4G
- yarn.nodemanager.resource.memory-mb=14336
- yarn.nodemanager.resource.cpu-vcores =7
Via Zeppelin (same notebook) I do an INSERT into a Hive table::
- dfPredictions.write.mode(SaveMode.Append).insertInto("default.predictions")
for a 50 column table with about 12 million records.
This gets split into 3 stages of 75, 75 and 200 tasks. The 75 and 75 get stuck at stages 73 and 74 and the garbage collection lasts for hours. Any idea what I can try?
EDIT: I have not looked at tweaking partitions, can anyone give me pointers on how to do that, please?