Support Questions

Find answers, ask questions, and share your expertise

How to add the hadoop and yarn configuration file to the Spark application class path ?

avatar
Rising Star

Hi All,

I am new to spark , I am trying to submit the spark application from the Java program and I am able to submit the one for spark standalone cluster .Actually what I want to achieve is submitting the job to the Yarn cluster and I am able to connect to the yarn cluster by explicitly adding the Resource Manager property in the spark config as below .

sparkConfig.set("spark.hadoop.yarn.resourcemanager.address","XXXX:8032");

But application is failing due to

exited with exitCode: -1000 due to: File file:/tmp/spark-0e6626c2-d344-4cae-897f-934e3eb01d8f/__spark_libs__1448521825653017037.zip does not exist

This I got it from the Resource manger log , what I found is that it is assuming the file system as local and not uploading the required libraries .

Source and destination file systems are the same. Not copying file:/tmp/spark-1ed67f05-d496-4000-86c1-07fcf8526181/__spark_libs__1740543841989079602.zip

This I got it from the Spark application where I am running my program .

Issue I am suspecting here is it is assuming the file system as local not hdfs , Correct me If I am wrong .

Question here is :

1.What is the actually issue for the job to fail , given the required data or log info above ?

2.Could you please tell me how to add the resource files to spark configuration like addResource in Hadoop configuration.

Thanks in Advance ,

Param.

1 ACCEPTED SOLUTION

avatar
Rising Star

@Artem Ervits ,

Thanks a lot for your time and help given.

However I am able to achieve my objective by setting the properties of hadoop and yarn in spark configuration .

sparkConfig.set("spark.hadoop.yarn.resourcemanager.hostname","XXX"); sparkConfig.set("spark.hadoop.yarn.resourcemanager.address","XXX:8032"); sparkConfig.set("spark.yarn.access.namenodes", "hdfs://XXXX:8020,hdfs://XXXX:8020"); sparkConfig.set("spark.yarn.stagingDir", "hdfs://XXXX:8020/user/hduser/");

Regards,

Param.

View solution in original post

9 REPLIES 9

avatar
Master Mentor

@Param NC please take a look at our documentation http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.5.3/bk_spark-component-guide/content/ch_developi...

for general knowledge here's an example of doing it in YARN mode, from: http://spark.apache.org/docs/1.6.2/submitting-applications.html usually HADOOP_CONF_DIR points to /etc/hadoop/conf on HDP distribution. That directory contains core-site.xml, yarn-site.xml, hdfs-site.xml etc.

export HADOOP_CONF_DIR=XXX
./bin/spark-submit \
  --class org.apache.spark.examples.SparkPi \
  --master yarn \
  --deploy-mode cluster \  # can be client for client mode
  --executor-memory 20G \
  --num-executors 50 \
  /path/to/examples.jar \
  1000

avatar
Master Mentor

@Param NC here's how I got it to work on my cluster

export HADOOP_CONF_DIR=/etc/hadoop/conf
/usr/hdp/current/spark-client/bin/spark-submit   --class org.apache.spark.examples.SparkPi   --master yarn   --deploy-mode cluster   --executor-memory 1G   --num-executors 3   /usr/hdp/current/spark-client/lib/spark-examples*.jar   100

avatar
Rising Star

@Artem Ervits

Thank you very much for the response .

I am able to submit the job to YARN through the spark-submit command ,but what actually I am looking here is for doing the same thing trough the program . It would be great if you would give the template for the same, java preferably .

-Param.

avatar
Master Mentor

@Param NC you need to build your application with hadoop-client dependency in your pom.xml or sbt, for scope, supply <scope>provided</scope>. http://spark.apache.org/docs/1.6.2/submitting-applications.html

Bundling Your Application’s Dependencies

If your code depends on other projects, you will need to package them alongside your application in order to distribute the code to a Spark cluster. To do this, create an assembly jar (or “uber” jar) containing your code and its dependencies. Both sbt and Maven have assembly plugins. When creating assembly jars, list Spark and Hadoop as provided dependencies; these need not be bundled since they are provided by the cluster manager at runtime. Once you have an assembled jar you can call the bin/spark-submit script as shown here while passing your jar.

For Python, you can use the --py-files argument of spark-submit to add .py, .zip or .egg files to be distributed with your application. If you depend on multiple Python files we recommend packaging them into a .zip or .egg.

More info here http://spark.apache.org/docs/1.6.2/running-on-yarn.html

Here's a sample pom.xml definition for hadoop-client

<dependencies>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-client</artifactId>
            <version>2.7.1.2.3.0.0-2557</version>
	    <scope>provided</scope>
            <type>jar</type>
        </dependency>
    </dependencies>
    <properties>
        <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
        <maven.compiler.source>1.7</maven.compiler.source>
        <maven.compiler.target>1.7</maven.compiler.target>
    </properties>
    
    <repositories>
        <repository>
            <id>HDPReleases</id>
            <name>HDP Releases</name>
            <url>http://repo.hortonworks.com/content/repositories/public</url>
            <layout>default</layout>
            <releases>
                <enabled>true</enabled>
                <updatePolicy>always</updatePolicy>
                <checksumPolicy>warn</checksumPolicy>
            </releases>
            <snapshots>
                <enabled>false</enabled>
                <updatePolicy>never</updatePolicy>
                <checksumPolicy>fail</checksumPolicy>
            </snapshots>
        </repository>
        <repository>
            <id>HDPJetty</id>
            <name>Hadoop Jetty</name>
            <url>http://repo.hortonworks.com/content/repositories/jetty-hadoop/</url>
            <layout>default</layout>
            <releases>
                <enabled>true</enabled>
                <updatePolicy>always</updatePolicy>
                <checksumPolicy>warn</checksumPolicy>
            </releases>
            <snapshots>
                <enabled>false</enabled>
                <updatePolicy>never</updatePolicy>
                <checksumPolicy>fail</checksumPolicy>
            </snapshots>
        </repository>
        <repository>
            <snapshots>
                <enabled>false</enabled>
            </snapshots>
            <id>central</id>
            <name>bintray</name>
            <url>http://jcenter.bintray.com</url>
        </repository>
    </repositories>

avatar
Rising Star

@Artem Ervits ,

Thanks again ! And Sorry If I am asking too many questions here .

What actually I am looking for is ..I should not use the spark-submit script as per the project requirement , So the cluster configuration I am passing through the spark config as given below .

SparkConf sparkConfig = new SparkConf().setAppName("Example App of Spark on Yarn"); sparkConfig.set("spark.hadoop.yarn.resourcemanager.hostname","XXXX"); sparkConfig.set("spark.hadoop.yarn.resourcemanager.address","XXXXX:8032");

And it is able to identify the Resource Manager but it failing because it is not identifying the file system .

Though I am setting the hdfs file system configuration as well.

sparkConfig.set("fs.defaultFS", "hdfs://xxxhacluster"); sparkConfig.set("ha.zookeeper.quorum", "xxx:2181,xxxx:2181,xxxx:2181"); And it assuming it as the local file system. And error I am getting in the Resource Manager is

exited with exitCode: -1000 due to: File file:/tmp/spark-0e6626c2-d344-4cae-897f-934e3eb01d8f/__spark_libs__1448521825653017037.zip does not exist

Thanks and Regards,

Param.

avatar
Master Mentor

have you tried the following?

import org.apache.hadoop.fs._
import org.apache.spark.deploy.SparkHadoopUtil
import java.net.URI

val hdfs_conf = SparkHadoopUtil.get.newConfiguration(sc.getConf)
val hdfs = FileSystem.get(hdfs_conf)

avatar
Rising Star

@Artem Ervits ,

Thanks a lot for your time and help given.

However I am able to achieve my objective by setting the properties of hadoop and yarn in spark configuration .

sparkConfig.set("spark.hadoop.yarn.resourcemanager.hostname","XXX"); sparkConfig.set("spark.hadoop.yarn.resourcemanager.address","XXX:8032"); sparkConfig.set("spark.yarn.access.namenodes", "hdfs://XXXX:8020,hdfs://XXXX:8020"); sparkConfig.set("spark.yarn.stagingDir", "hdfs://XXXX:8020/user/hduser/");

Regards,

Param.

avatar
Rising Star

Sorry for the delayed reply ...I got busy in some work.

@Artem Ervits thanks a lot for all the responses .

I was able to achieve this by setting the spark configuration as below ;-

sparkConfig.set("spark.hadoop.yarn.resourcemanager.hostname","XXXXX"); sparkConfig.set("spark.hadoop.yarn.resourcemanager.address","XXXXX:8032"); sparkConfig.set("spark.yarn.access.namenodes", "hdfs://XXXXX:8020,hdfs://XXXX:8020");

sparkConfig.set("spark.yarn.stagingDir", "hdfs://XXXXX:8020/user/hduser/");

sparkConfig.set("--deploy-mode", deployMode);

Thanks ,

Param.

avatar
New Contributor

I am not able to find spark.hadoop.yarn.* properties. these properties are not listed in any spark documents. please help me where can I find list of spark.hadoop.yarn properties?