Archives of Support Questions (Read Only)

This is an archived board for historical reference. Information and links may no longer be available or relevant
Announcements
This board is archived and read-only for historical reference. To ask a new question, please post a new topic on the appropriate active board.

How can I add configuration files to a Spark job running in YARN?

avatar

After reading the Spark documentation and source code, I can find two ways to reference an external configuration file inside of a Spark (v1.4.1) job, but I'm unable to get either one of them to work.

Method 1: from Spark documentation says to use ./bin/spark-submit --files /tmp/test_file.txt, but doesn't specify how to retrieve that file inside of a Spark job written in Java. I see it being added, but I don't see any configuration parameter in Java that will point me to the destination directory

INFO Client: Uploading resource file:/tmp/test_file.txt -> hdfs://sandbox.hortonworks.com:8020/user/guest/.sparkStaging/application_1452310382039_0019/test_file.txt

Method 2: from Spark source code suggests to use SparkContext.addFile(...) and SparkContext.textFile(SparkFiles.get(...)), but that doesn't work either as that directory doesn't exist in HDFS (only locally). I see this in the output of spark-submit --master yarn-client

16/01/09 07:10:09 INFO Utils: Copying /tmp/test_file.txt to /tmp/spark-8439cc21-656a-4f52-a87d-c151b88ff0d4/userFiles-00f58472-f947-4135-985b-fdb8cf4a1474/test_file.txt
16/01/09 07:10:09 INFO SparkContext: Added file /tmp/test_file.txt at http://192.168.1.13:39397/files/test_file.txt with timestamp 1452323409690
.
.
16/01/09 07:10:17 INFO SparkContext: Created broadcast 5 from textFile at Main.java:72
Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs://sandbox.hortonworks.com:8020/tmp/spark-8439cc21-656a-4f52-a87d-c151b88ff0d4/userFiles-00f58472-f947-4135-985b-fdb8cf4a1474/test_file.txt
1 ACCEPTED SOLUTION

avatar
New Member

If you add your external files using "spark-submit --files" your files will be uploaded to this HDFS folder: hdfs://your-cluster/user/your-user/.sparkStaging/application_1449220589084_0508

application_1449220589084_0508 is an example of yarn application ID!

In your spark application, you can find your files in 2 ways:

1- find the spark staging directory by below code: (but you need to have the hdfs uri and your username)

System.getenv("SPARK_YARN_STAGING_DIR"); --> .sparkStaging/application_1449220589084_0508

2- find the complete comma separated file paths by using:

System.getenv("SPARK_YARN_CACHE_FILES"); --> hdfs://yourcluster/user/hdfs/.sparkStaging/application_1449220589084_0508/spark-assembly-1.4.1.2.3.2.0-2950-hadoop2.7.1.2.3.2.0-2950.jar#__spark__.jar,hdfs://yourcluster/user/hdfs/.sparkStaging/application_1449220589084_0508/your-spark-job.jar#__app__.jar,hdfs://yourcluster/user/hdfs/.sparkStaging/application_1449220589084_0508/test_file.txt#test_file.txt

View solution in original post

2 REPLIES 2

avatar
New Member

If you add your external files using "spark-submit --files" your files will be uploaded to this HDFS folder: hdfs://your-cluster/user/your-user/.sparkStaging/application_1449220589084_0508

application_1449220589084_0508 is an example of yarn application ID!

In your spark application, you can find your files in 2 ways:

1- find the spark staging directory by below code: (but you need to have the hdfs uri and your username)

System.getenv("SPARK_YARN_STAGING_DIR"); --> .sparkStaging/application_1449220589084_0508

2- find the complete comma separated file paths by using:

System.getenv("SPARK_YARN_CACHE_FILES"); --> hdfs://yourcluster/user/hdfs/.sparkStaging/application_1449220589084_0508/spark-assembly-1.4.1.2.3.2.0-2950-hadoop2.7.1.2.3.2.0-2950.jar#__spark__.jar,hdfs://yourcluster/user/hdfs/.sparkStaging/application_1449220589084_0508/your-spark-job.jar#__app__.jar,hdfs://yourcluster/user/hdfs/.sparkStaging/application_1449220589084_0508/test_file.txt#test_file.txt

avatar

I ended up storing the file in HDFS and reading it through sc.textFile(args[0])