- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
Error in Spark Application - Missing an output location for shuffle 2
- Labels:
-
Apache Spark
Created ‎06-07-2017 10:28 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I am trying to run a spark application which is reading data from hive tables into dataframes and joining them. When i try to run the dataframe individually in spark shell then all joins works fine and i am able to persist data in ORC format in HDFS.
But when i run it as an application using spark submit i am getting below mentioned error.
Missing an output location for shuffle 2
I did a research on this and found this to be related to Memory issue. I am not getting that why this error is not coming in spark shell even with the same configuration and i am able to persist everything.
Command i am using to run application is mentioned below
spark-submit --master yarn-client --driver-memory 10g --num-executors 3 --executor-memory 10g --executor-cores 2 --class main.scala.test.Cences --files /usr/hdp/current/spark-client/conf/hive-site.xml --jars /usr/hdp/current/spark-client/lib/datanucleus-api-jdo-3.2.6.jar,/usr/hdp/current/spark-client/lib/datanucleus-rdbms-3.2.9.jar,/usr/hdp/current/spark-client/lib/datanucleus-core-3.2.10.jar /home/talend/test_2.11-0.0.1.jar
My cluster configuration is
2 Master Nodes, 3 slave nodes(4 cores and 28 GB each) and 1 Edge Node.
Hive tables from which i am reading data are of around 150 MB (very less) in size which is very less as compared to the memory i am giving to spark programs.
I am calling following dataframes functions i.e. saveAsTable(), write.format(), persist() in between in application.
Any suggestions would really be helpful?
Created ‎06-14-2017 07:08 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
hi @rahul gulati,
Apparently, number of partitions for your DataFrame / RDD is creating the issue.
This can be controlled by adjusting the spark.default.parallelism parameter in spark context or by using .repartition(<desired number>)
When you run in spark-shell please check the mode and number of cores allocated for the execution and adjust the value to which ever is working for the shell mode
Alternatively you can observe the same form Spark UI and come to a conclusion on partitions.
# from spark website on spark.default.parallelism
For distributed shuffle operations like reduceByKey and join, the largest number of partitions in a parent RDD. For operations like parallelize with no parent RDDs,
it depends on the cluster manager:
- Local mode: number of cores on the local machine
- Others: total number of cores on all executor nodes or 2, whichever is larger
