- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
How to add the hadoop and yarn configuration file to the Spark application class path ?
- Labels:
-
Apache Hadoop
-
Apache Spark
-
Apache YARN
Created 02-26-2017 12:01 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi All,
I am new to spark , I am trying to submit the spark application from the Java program and I am able to submit the one for spark standalone cluster .Actually what I want to achieve is submitting the job to the Yarn cluster and I am able to connect to the yarn cluster by explicitly adding the Resource Manager property in the spark config as below .
sparkConfig.set("spark.hadoop.yarn.resourcemanager.address","XXXX:8032");
But application is failing due to
exited with exitCode: -1000 due to: File file:/tmp/spark-0e6626c2-d344-4cae-897f-934e3eb01d8f/__spark_libs__1448521825653017037.zip does not exist
This I got it from the Resource manger log , what I found is that it is assuming the file system as local and not uploading the required libraries .
Source and destination file systems are the same. Not copying file:/tmp/spark-1ed67f05-d496-4000-86c1-07fcf8526181/__spark_libs__1740543841989079602.zip
This I got it from the Spark application where I am running my program .
Issue I am suspecting here is it is assuming the file system as local not hdfs , Correct me If I am wrong .
Question here is :
1.What is the actually issue for the job to fail , given the required data or log info above ?
2.Could you please tell me how to add the resource files to spark configuration like addResource in Hadoop configuration.
Thanks in Advance ,
Param.
Created 02-28-2017 04:58 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@Artem Ervits ,
Thanks a lot for your time and help given.
However I am able to achieve my objective by setting the properties of hadoop and yarn in spark configuration .
sparkConfig.set("spark.hadoop.yarn.resourcemanager.hostname","XXX"); sparkConfig.set("spark.hadoop.yarn.resourcemanager.address","XXX:8032"); sparkConfig.set("spark.yarn.access.namenodes", "hdfs://XXXX:8020,hdfs://XXXX:8020"); sparkConfig.set("spark.yarn.stagingDir", "hdfs://XXXX:8020/user/hduser/");
Regards,
Param.
Created 02-26-2017 03:12 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@Param NC please take a look at our documentation http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.5.3/bk_spark-component-guide/content/ch_developi...
for general knowledge here's an example of doing it in YARN mode, from: http://spark.apache.org/docs/1.6.2/submitting-applications.html usually HADOOP_CONF_DIR points to /etc/hadoop/conf on HDP distribution. That directory contains core-site.xml, yarn-site.xml, hdfs-site.xml etc.
export HADOOP_CONF_DIR=XXX ./bin/spark-submit \ --class org.apache.spark.examples.SparkPi \ --master yarn \ --deploy-mode cluster \ # can be client for client mode --executor-memory 20G \ --num-executors 50 \ /path/to/examples.jar \ 1000
Created 02-26-2017 03:37 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@Param NC here's how I got it to work on my cluster
export HADOOP_CONF_DIR=/etc/hadoop/conf /usr/hdp/current/spark-client/bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn --deploy-mode cluster --executor-memory 1G --num-executors 3 /usr/hdp/current/spark-client/lib/spark-examples*.jar 100
Created 02-27-2017 06:44 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@Artem Ervits
Thank you very much for the response .
I am able to submit the job to YARN through the spark-submit command ,but what actually I am looking here is for doing the same thing trough the program . It would be great if you would give the template for the same, java preferably .
-Param.
Created 02-27-2017 03:01 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@Param NC you need to build your application with hadoop-client dependency in your pom.xml or sbt, for scope, supply <scope>provided</scope>. http://spark.apache.org/docs/1.6.2/submitting-applications.html
Bundling Your Application’s Dependencies
If your code depends on other projects, you will need to package them alongside your application in order to distribute the code to a Spark cluster. To do this, create an assembly jar (or “uber” jar) containing your code and its dependencies. Both sbt and Maven have assembly plugins. When creating assembly jars, list Spark and Hadoop as provided
dependencies; these need not be bundled since they are provided by the cluster manager at runtime. Once you have an assembled jar you can call the bin/spark-submit
script as shown here while passing your jar.
For Python, you can use the --py-files
argument of spark-submit
to add .py
, .zip
or .egg
files to be distributed with your application. If you depend on multiple Python files we recommend packaging them into a .zip
or .egg
.
More info here http://spark.apache.org/docs/1.6.2/running-on-yarn.html
Here's a sample pom.xml definition for hadoop-client
<dependencies> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-client</artifactId> <version>2.7.1.2.3.0.0-2557</version> <scope>provided</scope> <type>jar</type> </dependency> </dependencies> <properties> <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding> <maven.compiler.source>1.7</maven.compiler.source> <maven.compiler.target>1.7</maven.compiler.target> </properties> <repositories> <repository> <id>HDPReleases</id> <name>HDP Releases</name> <url>http://repo.hortonworks.com/content/repositories/public</url> <layout>default</layout> <releases> <enabled>true</enabled> <updatePolicy>always</updatePolicy> <checksumPolicy>warn</checksumPolicy> </releases> <snapshots> <enabled>false</enabled> <updatePolicy>never</updatePolicy> <checksumPolicy>fail</checksumPolicy> </snapshots> </repository> <repository> <id>HDPJetty</id> <name>Hadoop Jetty</name> <url>http://repo.hortonworks.com/content/repositories/jetty-hadoop/</url> <layout>default</layout> <releases> <enabled>true</enabled> <updatePolicy>always</updatePolicy> <checksumPolicy>warn</checksumPolicy> </releases> <snapshots> <enabled>false</enabled> <updatePolicy>never</updatePolicy> <checksumPolicy>fail</checksumPolicy> </snapshots> </repository> <repository> <snapshots> <enabled>false</enabled> </snapshots> <id>central</id> <name>bintray</name> <url>http://jcenter.bintray.com</url> </repository> </repositories>
Created 02-27-2017 05:54 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@Artem Ervits ,
Thanks again ! And Sorry If I am asking too many questions here .
What actually I am looking for is ..I should not use the spark-submit script as per the project requirement , So the cluster configuration I am passing through the spark config as given below .
SparkConf sparkConfig = new SparkConf().setAppName("Example App of Spark on Yarn"); sparkConfig.set("spark.hadoop.yarn.resourcemanager.hostname","XXXX"); sparkConfig.set("spark.hadoop.yarn.resourcemanager.address","XXXXX:8032");
And it is able to identify the Resource Manager but it failing because it is not identifying the file system .
Though I am setting the hdfs file system configuration as well.
sparkConfig.set("fs.defaultFS", "hdfs://xxxhacluster"); sparkConfig.set("ha.zookeeper.quorum", "xxx:2181,xxxx:2181,xxxx:2181"); And it assuming it as the local file system. And error I am getting in the Resource Manager is
exited with exitCode: -1000 due to: File file:/tmp/spark-0e6626c2-d344-4cae-897f-934e3eb01d8f/__spark_libs__1448521825653017037.zip does not exist
Thanks and Regards,
Param.
Created 02-28-2017 12:32 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
have you tried the following?
import org.apache.hadoop.fs._ import org.apache.spark.deploy.SparkHadoopUtil import java.net.URI val hdfs_conf = SparkHadoopUtil.get.newConfiguration(sc.getConf) val hdfs = FileSystem.get(hdfs_conf)
Created 02-28-2017 04:58 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@Artem Ervits ,
Thanks a lot for your time and help given.
However I am able to achieve my objective by setting the properties of hadoop and yarn in spark configuration .
sparkConfig.set("spark.hadoop.yarn.resourcemanager.hostname","XXX"); sparkConfig.set("spark.hadoop.yarn.resourcemanager.address","XXX:8032"); sparkConfig.set("spark.yarn.access.namenodes", "hdfs://XXXX:8020,hdfs://XXXX:8020"); sparkConfig.set("spark.yarn.stagingDir", "hdfs://XXXX:8020/user/hduser/");
Regards,
Param.
Created 03-15-2017 11:25 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sorry for the delayed reply ...I got busy in some work.
@Artem Ervits thanks a lot for all the responses .
I was able to achieve this by setting the spark configuration as below ;-
sparkConfig.set("spark.hadoop.yarn.resourcemanager.hostname","XXXXX"); sparkConfig.set("spark.hadoop.yarn.resourcemanager.address","XXXXX:8032"); sparkConfig.set("spark.yarn.access.namenodes", "hdfs://XXXXX:8020,hdfs://XXXX:8020");
sparkConfig.set("spark.yarn.stagingDir", "hdfs://XXXXX:8020/user/hduser/");
sparkConfig.set("--deploy-mode", deployMode);
Thanks ,
Param.
Created 01-17-2018 10:43 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I am not able to find spark.hadoop.yarn.* properties. these properties are not listed in any spark documents. please help me where can I find list of spark.hadoop.yarn properties?
