Support Questions

Find answers, ask questions, and share your expertise

SPARK SUBMIT issue with SPARK 2.2

avatar
Expert Contributor

We are on HDP 2.6.3 and using SPARK 2.2 and running the job using on YARN CLUSTER mode.

using spark-submit and the spark-env.sh contains SPARK_YARN_DIST_FILES="/etc/spark2/conf/hive-site.xml,/etc/spark2/conf/hbase-site.xml" but these values are not honored.

spark-submit --class com.virtuslab.sparksql.MainClass  --master yarn --deploy-mode cluster /tmp/spark-hive-test/spark_sql_under_the_hood-spark2.2.0.jar

This is trying to connect to Hive and fetch the data from a table, but it fails with table on not found in database:

 diagnostics: User class threw exception: org.apache.spark.sql.catalyst.analysis.NoSuchTableException: Table or view 'xyz' not found in database 'qwerty';
         ApplicationMaster host: 121.121.121.121
         ApplicationMaster RPC port: 0
         queue: default
         start time: 1523616607943
         final status: FAILED
         tracking URL: https://managenode002xxserver:8090/proxy/application_1523374609937_10224/
         user: abc123

Exception in thread "main" org.apache.spark.SparkException: Application application_1523374609937_10224 finished with failed status
        at org.apache.spark.deploy.yarn.Client.run(Client.scala:1187)
        at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1233)
        at org.apache.spark.deploy.yarn.Client.main(Client.scala)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:497)
        at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$runMain(SparkSubmit.scala:782)
        at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
        at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119)
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

The same works when we pass the --files parameter:

spark-submit --class com.virtuslab.sparksql.MainClass  --master yarn --deploy-mode cluster --files /etc/spark2/conf/hive-site.xml /tmp/spark-hive-test/spark_sql_under_the_hood-spark2.2.0.jar

Result attached.

Any pointers why it is not using picking up SPARK_YARN_DIST_FILES?

Thanks

Venkat

8 REPLIES 8

avatar
Contributor
@Venkata Sudheer Kumar M

You can use --files parameter while deploying applications on Yarn like,

spark-submit --class com.virtuslab.sparksql.MainClass--master yarn --deploy-mode cluster --files /etc/spark2/conf/hive-site.xml,/etc/spark2/conf/hbase-site.xml /tmp/spark-hive-test/spark_sql_under_the_hood-spark2.2.0.jar

It worked in my case.

avatar
Expert Contributor
@Rohit Khose

As i have given --files does work, but when the file is given as part of SPARK_YARN_DIST_FILES and also the files are available in /etc/spark2/conf/hive-site.xml spark should be able to pick it up these any specific reason that this is not getting picked up?

Thanks

Venkat

avatar
Explorer

@Venkata Sudheer Kumar M

Can you please share the spark documentation which refers "SPARK_YARN_DIST_FILES" ?

In Spark 2.2 code, I couldn't locate usage of this env variable.

avatar
Expert Contributor
@Vinod K C

I haven't come across any document but from the HDP installation you can find it from: /etc/spark2/conf/spark-env.sh

# Options read in YARN client mode
#SPARK_EXECUTOR_INSTANCES="2" #Number of workers to start (Default: 2)
#SPARK_EXECUTOR_CORES="1" #Number of cores for the workers (Default: 1).
#SPARK_EXECUTOR_MEMORY="1G" #Memory per Worker (e.g. 1000M, 2G) (Default: 1G)
#SPARK_DRIVER_MEMORY="512M" #Memory for Master (e.g. 1000M, 2G) (Default: 512 Mb)
#SPARK_YARN_APP_NAME="spark" #The name of your application (Default: Spark)
#SPARK_YARN_QUEUE="default" #The hadoop queue to use for allocation requests (Default: default)
#SPARK_YARN_DIST_FILES="" #Comma separated list of files to be distributed with the job.
#SPARK_YARN_DIST_ARCHIVES="" #Comma separated list of archives to be distributed with the job.

But this says only YARN CLIENT mode.

And the job is not picking up the files available in /etc/spark2/conf as well.

Thanks

Venkat

avatar
Contributor

@Venkata Sudheer Kumar M, I'm not sure if SPARK_YARN_DIST_FILES is a valid spark-env value, but you can pass comma separated files using spark.yarn.dist.files spark property.

avatar
Expert Contributor

@Kiran Nittala

--files and --conf spark.yarn.dist.files both works, any specific reason we have to pass these parameters even though the files hive-site.xml and hbase-site.xml from /etc/spark2/conf

Thanks

Venkat

avatar
Contributor
@Venkata Sudheer Kumar M

Couple of things to note,

1. If hive-site.xml file is manually copied to spark2/conf folder, any Spark configuration changes from Ambari might have removed the hite-site.xml

2. As the deploy mode is cluster, you need to check if hive-site.xml and hbase-site.xml files are available under Spark conf in the driver machine and not on the machine where spark-submit command was executed.

avatar
Expert Contributor

This has been identified as a BUG in SPARK 2.2. which is fixed in SPARK 2.3