Support Questions

Find answers, ask questions, and share your expertise

How can I add configuration files to a Spark job running in YARN?

avatar

After reading the Spark documentation and source code, I can find two ways to reference an external configuration file inside of a Spark (v1.4.1) job, but I'm unable to get either one of them to work.

Method 1: from Spark documentation says to use ./bin/spark-submit --files /tmp/test_file.txt, but doesn't specify how to retrieve that file inside of a Spark job written in Java. I see it being added, but I don't see any configuration parameter in Java that will point me to the destination directory

INFO Client: Uploading resource file:/tmp/test_file.txt -> hdfs://sandbox.hortonworks.com:8020/user/guest/.sparkStaging/application_1452310382039_0019/test_file.txt

Method 2: from Spark source code suggests to use SparkContext.addFile(...) and SparkContext.textFile(SparkFiles.get(...)), but that doesn't work either as that directory doesn't exist in HDFS (only locally). I see this in the output of spark-submit --master yarn-client

16/01/09 07:10:09 INFO Utils: Copying /tmp/test_file.txt to /tmp/spark-8439cc21-656a-4f52-a87d-c151b88ff0d4/userFiles-00f58472-f947-4135-985b-fdb8cf4a1474/test_file.txt
16/01/09 07:10:09 INFO SparkContext: Added file /tmp/test_file.txt at http://192.168.1.13:39397/files/test_file.txt with timestamp 1452323409690
.
.
16/01/09 07:10:17 INFO SparkContext: Created broadcast 5 from textFile at Main.java:72
Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs://sandbox.hortonworks.com:8020/tmp/spark-8439cc21-656a-4f52-a87d-c151b88ff0d4/userFiles-00f58472-f947-4135-985b-fdb8cf4a1474/test_file.txt
1 ACCEPTED SOLUTION

avatar
Contributor

If you add your external files using "spark-submit --files" your files will be uploaded to this HDFS folder: hdfs://your-cluster/user/your-user/.sparkStaging/application_1449220589084_0508

application_1449220589084_0508 is an example of yarn application ID!

In your spark application, you can find your files in 2 ways:

1- find the spark staging directory by below code: (but you need to have the hdfs uri and your username)

System.getenv("SPARK_YARN_STAGING_DIR"); --> .sparkStaging/application_1449220589084_0508

2- find the complete comma separated file paths by using:

System.getenv("SPARK_YARN_CACHE_FILES"); --> hdfs://yourcluster/user/hdfs/.sparkStaging/application_1449220589084_0508/spark-assembly-1.4.1.2.3.2.0-2950-hadoop2.7.1.2.3.2.0-2950.jar#__spark__.jar,hdfs://yourcluster/user/hdfs/.sparkStaging/application_1449220589084_0508/your-spark-job.jar#__app__.jar,hdfs://yourcluster/user/hdfs/.sparkStaging/application_1449220589084_0508/test_file.txt#test_file.txt

View solution in original post

2 REPLIES 2

avatar
Contributor

If you add your external files using "spark-submit --files" your files will be uploaded to this HDFS folder: hdfs://your-cluster/user/your-user/.sparkStaging/application_1449220589084_0508

application_1449220589084_0508 is an example of yarn application ID!

In your spark application, you can find your files in 2 ways:

1- find the spark staging directory by below code: (but you need to have the hdfs uri and your username)

System.getenv("SPARK_YARN_STAGING_DIR"); --> .sparkStaging/application_1449220589084_0508

2- find the complete comma separated file paths by using:

System.getenv("SPARK_YARN_CACHE_FILES"); --> hdfs://yourcluster/user/hdfs/.sparkStaging/application_1449220589084_0508/spark-assembly-1.4.1.2.3.2.0-2950-hadoop2.7.1.2.3.2.0-2950.jar#__spark__.jar,hdfs://yourcluster/user/hdfs/.sparkStaging/application_1449220589084_0508/your-spark-job.jar#__app__.jar,hdfs://yourcluster/user/hdfs/.sparkStaging/application_1449220589084_0508/test_file.txt#test_file.txt

avatar

I ended up storing the file in HDFS and reading it through sc.textFile(args[0])