08-11-2017 10:15 AM
I have been trying to submit below spark job in cluster mode through a bash shell.
Client mode submit works perfectly fine. But when i switch to cluster mode, this fails with error, no app file present.
App file refers to missing application.conf.
--master yarn \
--deploy-mode cluster \
--class myCLASS \
--properties-file /home/abhig/spark.conf \
--files /home/abhig/application.conf \
--conf "spark.executor.extraJavaOptions=-Dconfig.resource=application.conf -Dlog4j.configuration=/home/abhig/log4.properties" \
--driver-java-options "-Dconfig.file=/home/abhig/application.conf -Dlog4j.configuration=/home/abhig/log4.properties" \
I followed the link below on similar post
This solution mentioned is still not clear.
I even tried
Still it doesn't work.
Any help will be appreciated.
08-13-2017 03:54 AM - edited 08-13-2017 04:04 AM
1) in cluster mode, you should use "--conf spark.driver.extraJavaOptions=" instead of "--driver-java-options"
2) you only provide application.conf in --file list, there's no log4.properties. So either you have this log4.properties distributed on each YARN node, or you should add this log4.properties file to --file list, and reference it with "-Dlog4j.configuration=./log4.properties"
For cluster mode, the full command should look like the following:
spark-submit \ --master yarn \ --deploy-mode cluster \ --class myCLASS \ --properties-file /home/abhig/spark.conf \ --files /home/abhig/application.conf,/home/abhig/log4.propertie \ --conf "spark.executor.extraJavaOptions=-Dconfig.resource=application.conf -Dlog4j.configuration=./log4.properties" \ --conf spark.driver.extraJavaOptions="-Dconfig.file=./application.conf -Dlog4j.configuration=./log4.properties" \ /loca/project/gateway/mypgm.jar
08-14-2017 08:00 AM
Thanks @Yuexin Zhang for the response.
I figured out the solution for this.
Below is the actual submit which worked for me.
The catch here is that when we submit in cluster mode, it uploads the file to a staging dir on hdfs.
Now the path and name of the file is different on hdfs then what it expects in the program.
To make that file available in the program, u have to make an alias for that file with '#' like mentioned below. (thats the only trick).
Now everywhere, u need to refer to that file, just mention that alias on spark submit command.
I mentioned the complete walkthrough and how to reach the solution in below links i referred to.
Issue also discussed here - https://community.cloudera.com/t5/Advanced-Analytics-Apache-Spark/Spark-File-not-found-error-works-f... - (Didn't actually helped me resolved, so i posted it separately)
Section "Important notes" in http://spark.apache.org/docs/latest/running-on-yarn.html ( Kinda have to read between the lines)
Blog explaining the reason - http://progexc.blogspot.com/2014/12/spark-configuration-mess-solved.html (Nice blog :) )
spark-submit \ --master yarn \ --deploy-mode cluster \ --class myCLASS \ --properties-file /home/abhig/spark.conf \ --files /home/abhig/application.conf#application.conf,/home/abhig/log4.properties#log4j \ --conf "spark.executor.extraJavaOptions=-Dconfig.resource=application.conf -Dlog4j.configuration=log4j" \ --conf spark.driver.extraJavaOptions="-Dconfig.file=application.conf -Dlog4j.configuration=log4j" \ /local/project/gateway/mypgm.jar
Hope this helps the next person facing similar issue!