Support Questions

Find answers, ask questions, and share your expertise
Celebrating as our community reaches 100,000 members! Thank you!

Error in Spark Application - Missing an output location for shuffle 2

Rising Star

I am trying to run a spark application which is reading data from hive tables into dataframes and joining them. When i try to run the dataframe individually in spark shell then all joins works fine and i am able to persist data in ORC format in HDFS.

But when i run it as an application using spark submit i am getting below mentioned error.

Missing an output location for shuffle 2

I did a research on this and found this to be related to Memory issue. I am not getting that why this error is not coming in spark shell even with the same configuration and i am able to persist everything.

Command i am using to run application is mentioned below

spark-submit --master yarn-client --driver-memory 10g --num-executors 3 --executor-memory 10g --executor-cores 2 --class main.scala.test.Cences --files /usr/hdp/current/spark-client/conf/hive-site.xml --jars /usr/hdp/current/spark-client/lib/datanucleus-api-jdo-3.2.6.jar,/usr/hdp/current/spark-client/lib/datanucleus-rdbms-3.2.9.jar,/usr/hdp/current/spark-client/lib/datanucleus-core-3.2.10.jar /home/talend/test_2.11-0.0.1.jar

My cluster configuration is

2 Master Nodes, 3 slave nodes(4 cores and 28 GB each) and 1 Edge Node.

Hive tables from which i am reading data are of around 150 MB (very less) in size which is very less as compared to the memory i am giving to spark programs.

I am calling following dataframes functions i.e. saveAsTable(), write.format(), persist() in between in application.

Any suggestions would really be helpful?


Super Collaborator

hi @rahul gulati,

Apparently, number of partitions for your DataFrame / RDD is creating the issue.

This can be controlled by adjusting the spark.default.parallelism parameter in spark context or by using .repartition(<desired number>)

When you run in spark-shell please check the mode and number of cores allocated for the execution and adjust the value to which ever is working for the shell mode

Alternatively you can observe the same form Spark UI and come to a conclusion on partitions.

# from spark website on spark.default.parallelism

For distributed shuffle operations like reduceByKey and join, the largest number of partitions in a parent RDD. For operations like parallelize with no parent RDDs,

it depends on the cluster manager:

  • Local mode: number of cores on the local machine
  • Others: total number of cores on all executor nodes or 2, whichever is larger