Created 02-26-2017 12:01 PM
Hi All,
I am new to spark , I am trying to submit the spark application from the Java program and I am able to submit the one for spark standalone cluster .Actually what I want to achieve is submitting the job to the Yarn cluster and I am able to connect to the yarn cluster by explicitly adding the Resource Manager property in the spark config as below .
sparkConfig.set("spark.hadoop.yarn.resourcemanager.address","XXXX:8032");
But application is failing due to
exited with exitCode: -1000 due to: File file:/tmp/spark-0e6626c2-d344-4cae-897f-934e3eb01d8f/__spark_libs__1448521825653017037.zip does not exist
This I got it from the Resource manger log , what I found is that it is assuming the file system as local and not uploading the required libraries .
Source and destination file systems are the same. Not copying file:/tmp/spark-1ed67f05-d496-4000-86c1-07fcf8526181/__spark_libs__1740543841989079602.zip
This I got it from the Spark application where I am running my program .
Issue I am suspecting here is it is assuming the file system as local not hdfs , Correct me If I am wrong .
Question here is :
1.What is the actually issue for the job to fail , given the required data or log info above ?
2.Could you please tell me how to add the resource files to spark configuration like addResource in Hadoop configuration.
Thanks in Advance ,
Param.
Created 02-28-2017 04:58 PM
@Artem Ervits ,
Thanks a lot for your time and help given.
However I am able to achieve my objective by setting the properties of hadoop and yarn in spark configuration .
sparkConfig.set("spark.hadoop.yarn.resourcemanager.hostname","XXX"); sparkConfig.set("spark.hadoop.yarn.resourcemanager.address","XXX:8032"); sparkConfig.set("spark.yarn.access.namenodes", "hdfs://XXXX:8020,hdfs://XXXX:8020"); sparkConfig.set("spark.yarn.stagingDir", "hdfs://XXXX:8020/user/hduser/");
Regards,
Param.
Created 02-26-2017 03:12 PM
@Param NC please take a look at our documentation http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.5.3/bk_spark-component-guide/content/ch_developi...
for general knowledge here's an example of doing it in YARN mode, from: http://spark.apache.org/docs/1.6.2/submitting-applications.html usually HADOOP_CONF_DIR points to /etc/hadoop/conf on HDP distribution. That directory contains core-site.xml, yarn-site.xml, hdfs-site.xml etc.
export HADOOP_CONF_DIR=XXX ./bin/spark-submit \ --class org.apache.spark.examples.SparkPi \ --master yarn \ --deploy-mode cluster \ # can be client for client mode --executor-memory 20G \ --num-executors 50 \ /path/to/examples.jar \ 1000
Created 02-26-2017 03:37 PM
@Param NC here's how I got it to work on my cluster
export HADOOP_CONF_DIR=/etc/hadoop/conf /usr/hdp/current/spark-client/bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn --deploy-mode cluster --executor-memory 1G --num-executors 3 /usr/hdp/current/spark-client/lib/spark-examples*.jar 100
Created 02-27-2017 06:44 AM
@Artem Ervits
Thank you very much for the response .
I am able to submit the job to YARN through the spark-submit command ,but what actually I am looking here is for doing the same thing trough the program . It would be great if you would give the template for the same, java preferably .
-Param.
Created 02-27-2017 03:01 PM
@Param NC you need to build your application with hadoop-client dependency in your pom.xml or sbt, for scope, supply <scope>provided</scope>. http://spark.apache.org/docs/1.6.2/submitting-applications.html
If your code depends on other projects, you will need to package them alongside your application in order to distribute the code to a Spark cluster. To do this, create an assembly jar (or “uber” jar) containing your code and its dependencies. Both sbt and Maven have assembly plugins. When creating assembly jars, list Spark and Hadoop as provided dependencies; these need not be bundled since they are provided by the cluster manager at runtime. Once you have an assembled jar you can call the bin/spark-submit script as shown here while passing your jar.
For Python, you can use the --py-files argument of spark-submit to add .py, .zip or .egg files to be distributed with your application. If you depend on multiple Python files we recommend packaging them into a .zip or .egg.
More info here http://spark.apache.org/docs/1.6.2/running-on-yarn.html
Here's a sample pom.xml definition for hadoop-client
<dependencies>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-client</artifactId>
            <version>2.7.1.2.3.0.0-2557</version>
	    <scope>provided</scope>
            <type>jar</type>
        </dependency>
    </dependencies>
    <properties>
        <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
        <maven.compiler.source>1.7</maven.compiler.source>
        <maven.compiler.target>1.7</maven.compiler.target>
    </properties>
    
    <repositories>
        <repository>
            <id>HDPReleases</id>
            <name>HDP Releases</name>
            <url>http://repo.hortonworks.com/content/repositories/public</url>
            <layout>default</layout>
            <releases>
                <enabled>true</enabled>
                <updatePolicy>always</updatePolicy>
                <checksumPolicy>warn</checksumPolicy>
            </releases>
            <snapshots>
                <enabled>false</enabled>
                <updatePolicy>never</updatePolicy>
                <checksumPolicy>fail</checksumPolicy>
            </snapshots>
        </repository>
        <repository>
            <id>HDPJetty</id>
            <name>Hadoop Jetty</name>
            <url>http://repo.hortonworks.com/content/repositories/jetty-hadoop/</url>
            <layout>default</layout>
            <releases>
                <enabled>true</enabled>
                <updatePolicy>always</updatePolicy>
                <checksumPolicy>warn</checksumPolicy>
            </releases>
            <snapshots>
                <enabled>false</enabled>
                <updatePolicy>never</updatePolicy>
                <checksumPolicy>fail</checksumPolicy>
            </snapshots>
        </repository>
        <repository>
            <snapshots>
                <enabled>false</enabled>
            </snapshots>
            <id>central</id>
            <name>bintray</name>
            <url>http://jcenter.bintray.com</url>
        </repository>
    </repositories>
					
				
			
			
				
			
			
			
			
			
			
			
		Created 02-27-2017 05:54 PM
@Artem Ervits ,
Thanks again ! And Sorry If I am asking too many questions here .
What actually I am looking for is ..I should not use the spark-submit script as per the project requirement , So the cluster configuration I am passing through the spark config as given below .
SparkConf sparkConfig = new SparkConf().setAppName("Example App of Spark on Yarn"); sparkConfig.set("spark.hadoop.yarn.resourcemanager.hostname","XXXX"); sparkConfig.set("spark.hadoop.yarn.resourcemanager.address","XXXXX:8032");
And it is able to identify the Resource Manager but it failing because it is not identifying the file system .
Though I am setting the hdfs file system configuration as well.
sparkConfig.set("fs.defaultFS", "hdfs://xxxhacluster"); sparkConfig.set("ha.zookeeper.quorum", "xxx:2181,xxxx:2181,xxxx:2181"); And it assuming it as the local file system. And error I am getting in the Resource Manager is
exited with exitCode: -1000 due to: File file:/tmp/spark-0e6626c2-d344-4cae-897f-934e3eb01d8f/__spark_libs__1448521825653017037.zip does not exist
Thanks and Regards,
Param.
Created 02-28-2017 12:32 PM
have you tried the following?
import org.apache.hadoop.fs._ import org.apache.spark.deploy.SparkHadoopUtil import java.net.URI val hdfs_conf = SparkHadoopUtil.get.newConfiguration(sc.getConf) val hdfs = FileSystem.get(hdfs_conf)
Created 02-28-2017 04:58 PM
@Artem Ervits ,
Thanks a lot for your time and help given.
However I am able to achieve my objective by setting the properties of hadoop and yarn in spark configuration .
sparkConfig.set("spark.hadoop.yarn.resourcemanager.hostname","XXX"); sparkConfig.set("spark.hadoop.yarn.resourcemanager.address","XXX:8032"); sparkConfig.set("spark.yarn.access.namenodes", "hdfs://XXXX:8020,hdfs://XXXX:8020"); sparkConfig.set("spark.yarn.stagingDir", "hdfs://XXXX:8020/user/hduser/");
Regards,
Param.
Created 03-15-2017 11:25 AM
Sorry for the delayed reply ...I got busy in some work.
@Artem Ervits thanks a lot for all the responses .
I was able to achieve this by setting the spark configuration as below ;-
sparkConfig.set("spark.hadoop.yarn.resourcemanager.hostname","XXXXX"); sparkConfig.set("spark.hadoop.yarn.resourcemanager.address","XXXXX:8032"); sparkConfig.set("spark.yarn.access.namenodes", "hdfs://XXXXX:8020,hdfs://XXXX:8020");
sparkConfig.set("spark.yarn.stagingDir", "hdfs://XXXXX:8020/user/hduser/");
sparkConfig.set("--deploy-mode", deployMode);
Thanks ,
Param.
Created 01-17-2018 10:43 PM
I am not able to find spark.hadoop.yarn.* properties. these properties are not listed in any spark documents. please help me where can I find list of spark.hadoop.yarn properties?
 
					
				
				
			
		
