I am trying to run a spark application which is reading data from hive tables into dataframes and joining them. When i try to run the dataframe individually in spark shell then all joins works fine and i am able to persist data in ORC format in HDFS.
But when i run it as an application using spark submit i am getting below mentioned error.
Missing an output location for shuffle 2
I did a research on this and found this to be related to Memory issue. I am not getting that why this error is not coming in spark shell even with the same configuration and i am able to persist everything.
Command i am using to run application is mentioned below
spark-submit --master yarn-client --driver-memory 10g --num-executors 3 --executor-memory 10g --executor-cores 2 --class main.scala.test.Cences --files /usr/hdp/current/spark-client/conf/hive-site.xml --jars /usr/hdp/current/spark-client/lib/datanucleus-api-jdo-3.2.6.jar,/usr/hdp/current/spark-client/lib/datanucleus-rdbms-3.2.9.jar,/usr/hdp/current/spark-client/lib/datanucleus-core-3.2.10.jar /home/talend/test_2.11-0.0.1.jar
My cluster configuration is
2 Master Nodes, 3 slave nodes(4 cores and 28 GB each) and 1 Edge Node.
Hive tables from which i am reading data are of around 150 MB (very less) in size which is very less as compared to the memory i am giving to spark programs.
I am calling following dataframes functions i.e. saveAsTable(), write.format(), persist() in between in application.
Any suggestions would really be helpful?