Support Questions
Find answers, ask questions, and share your expertise

Why is my Spark job stuck?

Expert Contributor

I am getting desperate here! My Spark2 jobs take hours then get stuck!

I have a 4 node cluster each with 16GB RAM and 8 cores. I run HDP 2.6, Spark 2.1 and Zeppelin 0.7.

I have:

  1. spark.executor.instances 11
  2. spark.executor.cores 2
  3. spark.executor.memory 4G
  4. yarn.nodemanager.resource.memory-mb=14336
  5. yarn.nodemanager.resource.cpu-vcores =7

Via Zeppelin (same notebook) I do an INSERT into a Hive table::

  1. dfPredictions.write.mode(SaveMode.Append).insertInto("default.predictions")

for a 50 column table with about 12 million records.

This gets split into 3 stages of 75, 75 and 200 tasks. The 75 and 75 get stuck at stages 73 and 74 and the garbage collection lasts for hours. Any idea what I can try?

EDIT: I have not looked at tweaking partitions, can anyone give me pointers on how to do that, please?

1 REPLY 1

Re: Why is my Spark job stuck?

Expert Contributor

Check whether SPARK_HOME in interpreter settings points to correct pyspark.

Is it set to below value?

SPARK_HOME/usr/hdp/current/spark2-client/

Where are you setting spark properties, in spark-env.sh or via Zeppelin? Check this thread:

https://issues.apache.org/jira/browse/ZEPPELIN-295

Do spark.driver.memory=4G, spark.driver.cores=2.

Check spark.memory.fraction (If it's set to 0.75, reduce it to 0.6) https://issues.apache.org/jira/browse/SPARK-15796

Check logs-> do tail -f /var/log/zeppelin/zeppelin-interpreter-spark2-spark-zeppelin-{HOSTNAME}.log in zeppelin host.