Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Setting the Spark configuration in yarn mode for HDFS file system.

avatar
Rising Star

Hi All,

When I am trying to run the spark application in YARN mode using the HDFS file system it works fine when I provide the below properties .

sparkConf.set("spark.hadoop.yarn.resourcemanager.hostname",resourcemanagerHostname); sparkConf.set("spark.hadoop.yarn.resourcemanager.address",resourcemanagerAddress); sparkConf.set("spark.yarn.stagingDir",stagingDirectory );

But the problem here ,

1.

Since my HDFS is NamdeNode HA enabled it won't work when I provide spark.yarn.stagingDir has the commons URL of hdfs example hdfs://hdcluster/user/tmp/ it gives error has unknown host hdcluster , But it works fine when I give the URL as hdfs://<ActiveNameNode>/user/tmp/ , But we don't in advance which will be active so how to resolve this .

And few things I have noticed are SparkContext takes the Hadoop configuration but SparkConfiguration class won't have any methods to accepts Hadoop configuration.

2.

How To provide the resource Manager address when Resource Manager are running in HA .

Thanks in Advance ,

Param.

3 REPLIES 3

avatar
Guru

@Param NC,

1. Can you try setting spark.yarn.stagingDir to hdfs:///user/tmp/ ?

2. Can you please share which spark config are you trying to set which require RM address?

avatar
Rising Star

@yvora Thanks for the response.

1. Can you try setting spark.yarn.stagingDir to hdfs:///user/tmp/ ?

This is not working .

2. Can you please share which spark config are you trying to set which require RM address?

I am trying to run the Spark application through java program , so when the master is yarn , by default it connects to resource manager @ 0.0.0.0:8032 in order to override this property , I need to set the same in spark configuration i.e

sparkConf.set("spark.hadoop.yarn.resourcemanager.hostname",resourcemanagerHostname); sparkConf.set("spark.hadoop.yarn.resourcemanager.address",resourcemanagerAddress);

But the problem is when the I have HA enabled the resource manager How to I connect to it .

And some idea I got about my question is :

There is way to achieve this in the Spark context as below .

JavaSparkContext jsc = new JavaSparkContext(sparkConf);

jsc.hadoopConfiguration().addResource(new Path(hadoopClusterSiteFilesBasePath + "core-site.xml")); jsc.hadoopConfiguration().addResource(new Path(hadoopClusterSiteFilesBasePath + "hdfs-site.xml")); jsc.hadoopConfiguration().addResource(new Path(hadoopClusterSiteFilesBasePath + "mapred-site.xml")); jsc.hadoopConfiguration().addResource(new Path(hadoopClusterSiteFilesBasePath + "yarn-site.xml"));

But it needs the resource manager and staging directory configuration even before creating the context ,so there is problem .

And what I am looking for is something like above for the SparkConguration class/object .

Thanks ,

Param.

avatar
New Contributor

@Param NC,

I am facing the same issue with trying to start a sparksession on yarn. Did you solve this ?