Created 04-13-2018 11:01 AM
We are on HDP 2.6.3 and using SPARK 2.2 and running the job using on YARN CLUSTER mode.
using spark-submit and the spark-env.sh contains SPARK_YARN_DIST_FILES="/etc/spark2/conf/hive-site.xml,/etc/spark2/conf/hbase-site.xml" but these values are not honored.
spark-submit --class com.virtuslab.sparksql.MainClass --master yarn --deploy-mode cluster /tmp/spark-hive-test/spark_sql_under_the_hood-spark2.2.0.jar
This is trying to connect to Hive and fetch the data from a table, but it fails with table on not found in database:
 diagnostics: User class threw exception: org.apache.spark.sql.catalyst.analysis.NoSuchTableException: Table or view 'xyz' not found in database 'qwerty';
         ApplicationMaster host: 121.121.121.121
         ApplicationMaster RPC port: 0
         queue: default
         start time: 1523616607943
         final status: FAILED
         tracking URL: https://managenode002xxserver:8090/proxy/application_1523374609937_10224/
         user: abc123
Exception in thread "main" org.apache.spark.SparkException: Application application_1523374609937_10224 finished with failed status
        at org.apache.spark.deploy.yarn.Client.run(Client.scala:1187)
        at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1233)
        at org.apache.spark.deploy.yarn.Client.main(Client.scala)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:497)
        at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$runMain(SparkSubmit.scala:782)
        at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
        at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119)
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)The same works when we pass the --files parameter:
spark-submit --class com.virtuslab.sparksql.MainClass --master yarn --deploy-mode cluster --files /etc/spark2/conf/hive-site.xml /tmp/spark-hive-test/spark_sql_under_the_hood-spark2.2.0.jar
Result attached.
Any pointers why it is not using picking up SPARK_YARN_DIST_FILES?
Thanks
Venkat
Created 04-13-2018 11:06 AM
You can use --files parameter while deploying applications on Yarn like,
spark-submit --class com.virtuslab.sparksql.MainClass--master yarn --deploy-mode cluster --files /etc/spark2/conf/hive-site.xml,/etc/spark2/conf/hbase-site.xml /tmp/spark-hive-test/spark_sql_under_the_hood-spark2.2.0.jar
It worked in my case.
Created 04-13-2018 11:37 AM
As i have given --files does work, but when the file is given as part of SPARK_YARN_DIST_FILES and also the files are available in /etc/spark2/conf/hive-site.xml spark should be able to pick it up these any specific reason that this is not getting picked up?
Thanks
Venkat
Created 04-17-2018 02:21 PM
Can you please share the spark documentation which refers "SPARK_YARN_DIST_FILES" ?
In Spark 2.2 code, I couldn't locate usage of this env variable.
Created 04-17-2018 02:39 PM
I haven't come across any document but from the HDP installation you can find it from: /etc/spark2/conf/spark-env.sh
# Options read in YARN client mode #SPARK_EXECUTOR_INSTANCES="2" #Number of workers to start (Default: 2) #SPARK_EXECUTOR_CORES="1" #Number of cores for the workers (Default: 1). #SPARK_EXECUTOR_MEMORY="1G" #Memory per Worker (e.g. 1000M, 2G) (Default: 1G) #SPARK_DRIVER_MEMORY="512M" #Memory for Master (e.g. 1000M, 2G) (Default: 512 Mb) #SPARK_YARN_APP_NAME="spark" #The name of your application (Default: Spark) #SPARK_YARN_QUEUE="default" #The hadoop queue to use for allocation requests (Default: default) #SPARK_YARN_DIST_FILES="" #Comma separated list of files to be distributed with the job. #SPARK_YARN_DIST_ARCHIVES="" #Comma separated list of archives to be distributed with the job.
But this says only YARN CLIENT mode.
And the job is not picking up the files available in /etc/spark2/conf as well.
Thanks
Venkat
Created 04-17-2018 02:34 PM
@Venkata Sudheer Kumar M, I'm not sure if SPARK_YARN_DIST_FILES is a valid spark-env value, but you can pass comma separated files using spark.yarn.dist.files spark property.
Created 04-17-2018 02:55 PM
--files and --conf spark.yarn.dist.files both works, any specific reason we have to pass these parameters even though the files hive-site.xml and hbase-site.xml from /etc/spark2/conf
Thanks
Venkat
Created 04-19-2018 08:50 PM
Couple of things to note,
1. If hive-site.xml file is manually copied to spark2/conf folder, any Spark configuration changes from Ambari might have removed the hite-site.xml
2. As the deploy mode is cluster, you need to check if hive-site.xml and hbase-site.xml files are available under Spark conf in the driver machine and not on the machine where spark-submit command was executed.
Created 04-20-2018 11:28 AM
This has been identified as a BUG in SPARK 2.2. which is fixed in SPARK 2.3
 
					
				
				
			
		
