We are on HDP 2.6.3 and using SPARK 2.2 and running the job using on YARN CLUSTER mode.
using spark-submit and the spark-env.sh contains SPARK_YARN_DIST_FILES="/etc/spark2/conf/hive-site.xml,/etc/spark2/conf/hbase-site.xml" but these values are not honored.
spark-submit --class com.virtuslab.sparksql.MainClass --master yarn --deploy-mode cluster /tmp/spark-hive-test/spark_sql_under_the_hood-spark2.2.0.jar
This is trying to connect to Hive and fetch the data from a table, but it fails with table on not found in database:
diagnostics: User class threw exception: org.apache.spark.sql.catalyst.analysis.NoSuchTableException: Table or view 'xyz' not found in database 'qwerty'; ApplicationMaster host: 188.8.131.52 ApplicationMaster RPC port: 0 queue: default start time: 1523616607943 final status: FAILED tracking URL: https://managenode002xxserver:8090/proxy/application_1523374609937_10224/ user: abc123 Exception in thread "main" org.apache.spark.SparkException: Application application_1523374609937_10224 finished with failed status at org.apache.spark.deploy.yarn.Client.run(Client.scala:1187) at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1233) at org.apache.spark.deploy.yarn.Client.main(Client.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:497) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$runMain(SparkSubmit.scala:782) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
The same works when we pass the --files parameter:
spark-submit --class com.virtuslab.sparksql.MainClass --master yarn --deploy-mode cluster --files /etc/spark2/conf/hive-site.xml /tmp/spark-hive-test/spark_sql_under_the_hood-spark2.2.0.jar
Any pointers why it is not using picking up SPARK_YARN_DIST_FILES?
You can use --files parameter while deploying applications on Yarn like,
spark-submit --class com.virtuslab.sparksql.MainClass--master yarn --deploy-mode cluster --files /etc/spark2/conf/hive-site.xml,/etc/spark2/conf/hbase-site.xml /tmp/spark-hive-test/spark_sql_under_the_hood-spark2.2.0.jar
It worked in my case.
As i have given --files does work, but when the file is given as part of SPARK_YARN_DIST_FILES and also the files are available in /etc/spark2/conf/hive-site.xml spark should be able to pick it up these any specific reason that this is not getting picked up?
I haven't come across any document but from the HDP installation you can find it from: /etc/spark2/conf/spark-env.sh
# Options read in YARN client mode #SPARK_EXECUTOR_INSTANCES="2" #Number of workers to start (Default: 2) #SPARK_EXECUTOR_CORES="1" #Number of cores for the workers (Default: 1). #SPARK_EXECUTOR_MEMORY="1G" #Memory per Worker (e.g. 1000M, 2G) (Default: 1G) #SPARK_DRIVER_MEMORY="512M" #Memory for Master (e.g. 1000M, 2G) (Default: 512 Mb) #SPARK_YARN_APP_NAME="spark" #The name of your application (Default: Spark) #SPARK_YARN_QUEUE="default" #The hadoop queue to use for allocation requests (Default: default) #SPARK_YARN_DIST_FILES="" #Comma separated list of files to be distributed with the job. #SPARK_YARN_DIST_ARCHIVES="" #Comma separated list of archives to be distributed with the job.
But this says only YARN CLIENT mode.
And the job is not picking up the files available in /etc/spark2/conf as well.
--files and --conf spark.yarn.dist.files both works, any specific reason we have to pass these parameters even though the files hive-site.xml and hbase-site.xml from /etc/spark2/conf
Couple of things to note,
1. If hive-site.xml file is manually copied to spark2/conf folder, any Spark configuration changes from Ambari might have removed the hite-site.xml
2. As the deploy mode is cluster, you need to check if hive-site.xml and hbase-site.xml files are available under Spark conf in the driver machine and not on the machine where spark-submit command was executed.