Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Setting the Spark configuration in yarn mode for HDFS file system.

Highlighted

Setting the Spark configuration in yarn mode for HDFS file system.

Contributor

Hi All,

When I am trying to run the spark application in YARN mode using the HDFS file system it works fine when I provide the below properties .

sparkConf.set("spark.hadoop.yarn.resourcemanager.hostname",resourcemanagerHostname); sparkConf.set("spark.hadoop.yarn.resourcemanager.address",resourcemanagerAddress); sparkConf.set("spark.yarn.stagingDir",stagingDirectory );

But the problem here ,

1.

Since my HDFS is NamdeNode HA enabled it won't work when I provide spark.yarn.stagingDir has the commons URL of hdfs example hdfs://hdcluster/user/tmp/ it gives error has unknown host hdcluster , But it works fine when I give the URL as hdfs://<ActiveNameNode>/user/tmp/ , But we don't in advance which will be active so how to resolve this .

And few things I have noticed are SparkContext takes the Hadoop configuration but SparkConfiguration class won't have any methods to accepts Hadoop configuration.

2.

How To provide the resource Manager address when Resource Manager are running in HA .

Thanks in Advance ,

Param.

3 REPLIES 3

Re: Setting the Spark configuration in yarn mode for HDFS file system.

Guru

@Param NC,

1. Can you try setting spark.yarn.stagingDir to hdfs:///user/tmp/ ?

2. Can you please share which spark config are you trying to set which require RM address?

Re: Setting the Spark configuration in yarn mode for HDFS file system.

Contributor

@yvora Thanks for the response.

1. Can you try setting spark.yarn.stagingDir to hdfs:///user/tmp/ ?

This is not working .

2. Can you please share which spark config are you trying to set which require RM address?

I am trying to run the Spark application through java program , so when the master is yarn , by default it connects to resource manager @ 0.0.0.0:8032 in order to override this property , I need to set the same in spark configuration i.e

sparkConf.set("spark.hadoop.yarn.resourcemanager.hostname",resourcemanagerHostname); sparkConf.set("spark.hadoop.yarn.resourcemanager.address",resourcemanagerAddress);

But the problem is when the I have HA enabled the resource manager How to I connect to it .

And some idea I got about my question is :

There is way to achieve this in the Spark context as below .

JavaSparkContext jsc = new JavaSparkContext(sparkConf);

jsc.hadoopConfiguration().addResource(new Path(hadoopClusterSiteFilesBasePath + "core-site.xml")); jsc.hadoopConfiguration().addResource(new Path(hadoopClusterSiteFilesBasePath + "hdfs-site.xml")); jsc.hadoopConfiguration().addResource(new Path(hadoopClusterSiteFilesBasePath + "mapred-site.xml")); jsc.hadoopConfiguration().addResource(new Path(hadoopClusterSiteFilesBasePath + "yarn-site.xml"));

But it needs the resource manager and staging directory configuration even before creating the context ,so there is problem .

And what I am looking for is something like above for the SparkConguration class/object .

Thanks ,

Param.

Re: Setting the Spark configuration in yarn mode for HDFS file system.

New Contributor

@Param NC,

I am facing the same issue with trying to start a sparksession on yarn. Did you solve this ?

Don't have an account?
Coming from Hortonworks? Activate your account here