Support Questions

dheer_vijji_rag · ‎04-13-2018

We are on HDP 2.6.3 and using SPARK 2.2 and running the job using on YARN CLUSTER mode.

using spark-submit and the spark-env.sh contains SPARK_YARN_DIST_FILES="/etc/spark2/conf/hive-site.xml,/etc/spark2/conf/hbase-site.xml" but these values are not honored.

spark-submit --class com.virtuslab.sparksql.MainClass  --master yarn --deploy-mode cluster /tmp/spark-hive-test/spark_sql_under_the_hood-spark2.2.0.jar

This is trying to connect to Hive and fetch the data from a table, but it fails with table on not found in database:

 diagnostics: User class threw exception: org.apache.spark.sql.catalyst.analysis.NoSuchTableException: Table or view 'xyz' not found in database 'qwerty';
         ApplicationMaster host: 121.121.121.121
         ApplicationMaster RPC port: 0
         queue: default
         start time: 1523616607943
         final status: FAILED
         tracking URL: https://managenode002xxserver:8090/proxy/application_1523374609937_10224/
         user: abc123

Exception in thread "main" org.apache.spark.SparkException: Application application_1523374609937_10224 finished with failed status
        at org.apache.spark.deploy.yarn.Client.run(Client.scala:1187)
        at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1233)
        at org.apache.spark.deploy.yarn.Client.main(Client.scala)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:497)
        at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$runMain(SparkSubmit.scala:782)
        at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
        at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119)
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

The same works when we pass the --files parameter:

spark-submit --class com.virtuslab.sparksql.MainClass  --master yarn --deploy-mode cluster --files /etc/spark2/conf/hive-site.xml /tmp/spark-hive-test/spark_sql_under_the_hood-spark2.2.0.jar

Result attached.

Any pointers why it is not using picking up SPARK_YARN_DIST_FILES?

Thanks

Venkat

rohit_khose · ‎04-13-2018

@Venkata Sudheer Kumar M

You can use --files parameter while deploying applications on Yarn like,

spark-submit --class com.virtuslab.sparksql.MainClass--master yarn --deploy-mode cluster --files /etc/spark2/conf/hive-site.xml,/etc/spark2/conf/hbase-site.xml /tmp/spark-hive-test/spark_sql_under_the_hood-spark2.2.0.jar

It worked in my case.

dheer_vijji_rag · ‎04-13-2018

@Rohit Khose

As i have given --files does work, but when the file is given as part of SPARK_YARN_DIST_FILES and also the files are available in /etc/spark2/conf/hive-site.xml spark should be able to pick it up these any specific reason that this is not getting picked up?

Thanks

Venkat

vkc · ‎04-17-2018

@Venkata Sudheer Kumar M

Can you please share the spark documentation which refers "SPARK_YARN_DIST_FILES" ?

In Spark 2.2 code, I couldn't locate usage of this env variable.

dheer_vijji_rag · ‎04-17-2018

@Vinod K C

I haven't come across any document but from the HDP installation you can find it from: /etc/spark2/conf/spark-env.sh

# Options read in YARN client mode
#SPARK_EXECUTOR_INSTANCES="2" #Number of workers to start (Default: 2)
#SPARK_EXECUTOR_CORES="1" #Number of cores for the workers (Default: 1).
#SPARK_EXECUTOR_MEMORY="1G" #Memory per Worker (e.g. 1000M, 2G) (Default: 1G)
#SPARK_DRIVER_MEMORY="512M" #Memory for Master (e.g. 1000M, 2G) (Default: 512 Mb)
#SPARK_YARN_APP_NAME="spark" #The name of your application (Default: Spark)
#SPARK_YARN_QUEUE="default" #The hadoop queue to use for allocation requests (Default: default)
#SPARK_YARN_DIST_FILES="" #Comma separated list of files to be distributed with the job.
#SPARK_YARN_DIST_ARCHIVES="" #Comma separated list of archives to be distributed with the job.

But this says only YARN CLIENT mode.

And the job is not picking up the files available in /etc/spark2/conf as well.

Thanks

Venkat

vnittala · ‎04-17-2018

@Venkata Sudheer Kumar M, I'm not sure if SPARK_YARN_DIST_FILES is a valid spark-env value, but you can pass comma separated files using spark.yarn.dist.files spark property.

dheer_vijji_rag · ‎04-17-2018

@Kiran Nittala

--files and --conf spark.yarn.dist.files both works, any specific reason we have to pass these parameters even though the files hive-site.xml and hbase-site.xml from /etc/spark2/conf

Thanks

Venkat

vnittala · ‎04-19-2018

@Venkata Sudheer Kumar M

Couple of things to note,

1. If hive-site.xml file is manually copied to spark2/conf folder, any Spark configuration changes from Ambari might have removed the hite-site.xml

2. As the deploy mode is cluster, you need to check if hive-site.xml and hbase-site.xml files are available under Spark conf in the driver machine and not on the machine where spark-submit command was executed.

dheer_vijji_rag · ‎04-20-2018

This has been identified as a BUG in SPARK 2.2. which is fixed in SPARK 2.3

Cloudera Community

Support Questions

SPARK SUBMIT issue with SPARK 2.2