<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Re: How can I add configuration files to a Spark job running in YARN? in Support Questions</title>
    <link>https://community.cloudera.com/t5/Support-Questions/How-can-I-add-configuration-files-to-a-Spark-job-running-in/m-p/101951#M64911</link>
    <description>&lt;P&gt;If you add your external files using "spark-submit --files" your files will be uploaded to this HDFS folder: hdfs://your-cluster/user/your-user/.sparkStaging/application_1449220589084_0508&lt;/P&gt;&lt;P&gt;application_1449220589084_0508 is an example of yarn application ID!&lt;/P&gt;&lt;P&gt;In your spark application, you can find your files in 2 ways:&lt;/P&gt;&lt;P&gt;1- find the spark staging directory by below code:  (but you need to have the hdfs uri and your username)&lt;/P&gt;&lt;PRE&gt;System.getenv("SPARK_YARN_STAGING_DIR"); --&amp;gt; .sparkStaging/application_1449220589084_0508&lt;/PRE&gt;&lt;P&gt;2- find the complete comma separated file paths by using:&lt;/P&gt;&lt;PRE&gt;System.getenv("SPARK_YARN_CACHE_FILES"); --&amp;gt; hdfs://yourcluster/user/hdfs/.sparkStaging/application_1449220589084_0508/spark-assembly-1.4.1.2.3.2.0-2950-hadoop2.7.1.2.3.2.0-2950.jar#__spark__.jar,hdfs://yourcluster/user/hdfs/.sparkStaging/application_1449220589084_0508/your-spark-job.jar#__app__.jar,hdfs://yourcluster/user/hdfs/.sparkStaging/application_1449220589084_0508/test_file.txt#test_file.txt&lt;/PRE&gt;</description>
    <pubDate>Tue, 12 Jan 2016 01:07:03 GMT</pubDate>
    <dc:creator>mahan</dc:creator>
    <dc:date>2016-01-12T01:07:03Z</dc:date>
    <item>
      <title>How can I add configuration files to a Spark job running in YARN?</title>
      <link>https://community.cloudera.com/t5/Support-Questions/How-can-I-add-configuration-files-to-a-Spark-job-running-in/m-p/101950#M64910</link>
      <description>&lt;P&gt;After reading the Spark documentation and source code, I can find two ways to reference an external configuration file inside of a Spark (v1.4.1) job, but I'm unable to get either one of them to work.  &lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Method 1&lt;/STRONG&gt;: from &lt;A href="http://spark.apache.org/docs/latest/submitting-applications.html"&gt;Spark documentation&lt;/A&gt; says to use &lt;EM&gt;./bin/spark-submit --files /tmp/test_file.txt,&lt;/EM&gt; but doesn't specify how to retrieve that file inside of a Spark job written in Java.  I see it being added,&lt;EM&gt; &lt;/EM&gt;but I don't see any configuration parameter in Java that will point me to the destination directory&lt;/P&gt;&lt;PRE&gt;INFO Client: Uploading resource file:/tmp/test_file.txt -&amp;gt; hdfs://sandbox.hortonworks.com:8020/user/guest/.sparkStaging/application_1452310382039_0019/test_file.txt&lt;/PRE&gt;&lt;P&gt;&lt;STRONG&gt;Method 2:&lt;/STRONG&gt; &lt;A href="https://github.com/apache/spark/blob/master/core%2Fsrc%2Fmain%2Fscala%2Forg%2Fapache%2Fspark%2FSparkContext.scala#L1330"&gt;from Spark source code&lt;/A&gt; suggests to use &lt;EM&gt;SparkContext.addFile(...) &lt;/EM&gt;and &lt;EM&gt;SparkContext.&lt;/EM&gt;&lt;EM&gt;textFile(SparkFiles.get(...))&lt;/EM&gt;, but that doesn't work either as that directory doesn't exist in HDFS (only locally).  I see this in the output of &lt;EM&gt;spark-submit --master yarn-client&lt;/EM&gt;&lt;/P&gt;&lt;PRE&gt;16/01/09 07:10:09 INFO Utils: Copying /tmp/test_file.txt to /tmp/spark-8439cc21-656a-4f52-a87d-c151b88ff0d4/userFiles-00f58472-f947-4135-985b-fdb8cf4a1474/test_file.txt
16/01/09 07:10:09 INFO SparkContext: Added file /tmp/test_file.txt at &lt;A href="http://192.168.1.13:39397/files/test_file.txt" target="_blank"&gt;http://192.168.1.13:39397/files/test_file.txt&lt;/A&gt; with timestamp 1452323409690
.
.
16/01/09 07:10:17 INFO SparkContext: Created broadcast 5 from textFile at Main.java:72
Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs://sandbox.hortonworks.com:8020/tmp/spark-8439cc21-656a-4f52-a87d-c151b88ff0d4/userFiles-00f58472-f947-4135-985b-fdb8cf4a1474/test_file.txt&lt;/PRE&gt;</description>
      <pubDate>Sat, 09 Jan 2016 15:32:43 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/How-can-I-add-configuration-files-to-a-Spark-job-running-in/m-p/101950#M64910</guid>
      <dc:creator>vzlatkin</dc:creator>
      <dc:date>2016-01-09T15:32:43Z</dc:date>
    </item>
    <item>
      <title>Re: How can I add configuration files to a Spark job running in YARN?</title>
      <link>https://community.cloudera.com/t5/Support-Questions/How-can-I-add-configuration-files-to-a-Spark-job-running-in/m-p/101951#M64911</link>
      <description>&lt;P&gt;If you add your external files using "spark-submit --files" your files will be uploaded to this HDFS folder: hdfs://your-cluster/user/your-user/.sparkStaging/application_1449220589084_0508&lt;/P&gt;&lt;P&gt;application_1449220589084_0508 is an example of yarn application ID!&lt;/P&gt;&lt;P&gt;In your spark application, you can find your files in 2 ways:&lt;/P&gt;&lt;P&gt;1- find the spark staging directory by below code:  (but you need to have the hdfs uri and your username)&lt;/P&gt;&lt;PRE&gt;System.getenv("SPARK_YARN_STAGING_DIR"); --&amp;gt; .sparkStaging/application_1449220589084_0508&lt;/PRE&gt;&lt;P&gt;2- find the complete comma separated file paths by using:&lt;/P&gt;&lt;PRE&gt;System.getenv("SPARK_YARN_CACHE_FILES"); --&amp;gt; hdfs://yourcluster/user/hdfs/.sparkStaging/application_1449220589084_0508/spark-assembly-1.4.1.2.3.2.0-2950-hadoop2.7.1.2.3.2.0-2950.jar#__spark__.jar,hdfs://yourcluster/user/hdfs/.sparkStaging/application_1449220589084_0508/your-spark-job.jar#__app__.jar,hdfs://yourcluster/user/hdfs/.sparkStaging/application_1449220589084_0508/test_file.txt#test_file.txt&lt;/PRE&gt;</description>
      <pubDate>Tue, 12 Jan 2016 01:07:03 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/How-can-I-add-configuration-files-to-a-Spark-job-running-in/m-p/101951#M64911</guid>
      <dc:creator>mahan</dc:creator>
      <dc:date>2016-01-12T01:07:03Z</dc:date>
    </item>
    <item>
      <title>Re: How can I add configuration files to a Spark job running in YARN?</title>
      <link>https://community.cloudera.com/t5/Support-Questions/How-can-I-add-configuration-files-to-a-Spark-job-running-in/m-p/101952#M64912</link>
      <description>&lt;P&gt;I ended up storing the file in HDFS and reading it through sc.textFile(args[0])&lt;/P&gt;</description>
      <pubDate>Wed, 25 May 2016 06:21:22 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/How-can-I-add-configuration-files-to-a-Spark-job-running-in/m-p/101952#M64912</guid>
      <dc:creator>vzlatkin</dc:creator>
      <dc:date>2016-05-25T06:21:22Z</dc:date>
    </item>
  </channel>
</rss>

