Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Spark: Read from remote HA-enabled HDFS using nameservice ID

avatar
Explorer

I want to run a Spark application on an AWS EMR instance and have this application read/write from/to a remote HA-enabled HDFS cluster. To do this I have been deploying Spark and Hadoop on my EMR instance then running an AWS step which overwrites the 'default' hadoop config files with my remote HDFS clusters core-site.xml and hdfs-site.xml config files.

This means I can use my remote clusters nameserviceID in my Spark logic when reading/writing to/from HDFS e.g. df.write.parquet("hdfs://mycluster/path/to/file") where mycluster is the nameserviceID of the remote HDFS cluster.

Is it possible to use the nameserviceID of my remote HDFS cluster without installing hadoop on my EMR instance and having to overwrite the core-site.xml and hdfs-site.xml config files?

I have tried setting the following config in Spark:

val sc = new SparkContext 
sc.hadoopConfiguration.set("dfs.nameservices", "mycluster")
sc.hadoopConfiguration.set("dfs.client.failover.proxy.provider.mycluster", "org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider")
sc.hadoopConfiguration.set("dfs.ha.namenodes.mycluster", "nn1,nn2")
sc.hadoopConfiguration.set("dfs.namenode.rpc-address.mycluster.nn1", "fqdn.of.nn1:8020")
sc.hadoopConfiguration.set("dfs.namenode.rpc-address.mycluster.nn2", "fqdn.of.nn2:8020")

but this fails with the following exception when attempting to read/write to the remote HDFS cluster using the nameserviceID:

Exception in thread "main" java.io.IOException: Couldn't create proxy provider class org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider
	at org.apache.hadoop.hdfs.NameNodeProxies.createFailoverProxyProvider(NameNodeProxies.java:515)
	at org.apache.hadoop.hdfs.NameNodeProxies.createProxy(NameNodeProxies.java:171)
	at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:678)
	at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:619)
	at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:149)
	at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2653)
	at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:92)
	at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2687)
	at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2669)
	at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:371)
	at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
	at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:231)
	at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:148)
	at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:139)
	at org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:334)
Caused by: java.lang.reflect.InvocationTargetException
	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
	at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
	at org.apache.hadoop.hdfs.NameNodeProxies.createFailoverProxyProvider(NameNodeProxies.java:498)
	... 20 more
Caused by: java.lang.RuntimeException: Could not find any configured addresses for URI hdfs://mycluster/path/to/file
	at org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider.<init>(ConfiguredFailoverProxyProvider.java:93)
	... 25 more

The config I set on the hadoop configuration object matches that which is set for my remote HDFS cluster.

Any help appreciated.

3 REPLIES 3

avatar
Explorer

Same problème here 

avatar

Hi @zakariadem 

 

While we welcome your question, just based on the lack of responses to the original question in this thread since it was posted in Nov 2017, we think you would be much more likely to obtain a suitable answer if you posted it to the appropriate AWS forum for EMR.

 

 

Bill Brooks, Community Moderator
Was your question answered? Make sure to mark the answer as the accepted solution.
If you find a reply useful, say thanks by clicking on the thumbs up button.

avatar
Explorer

Same problem:

21/07/13 15:56:34 ERROR ExecutionImpl: Couldn't create proxy provider class org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider
java.io.IOException: Couldn't create proxy provider class org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider
	at org.apache.hadoop.hdfs.NameNodeProxies.createFailoverProxyProvider(NameNodeProxies.java:530)
	at org.apache.hadoop.hdfs.NameNodeProxies.createProxy(NameNodeProxies.java:172)
	at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:662)
	at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:606)
	at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:148)
	at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2596)
	at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)
	at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2630)
	at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2612)
	at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370)
	at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
	at org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:97)
	at org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodes(TokenCache.java:80)
	at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:205)
	at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313)
	at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
	at scala.Option.getOrElse(Option.scala:121)
	at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)

	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:637)
Caused by: java.lang.reflect.InvocationTargetException
	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
	at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
	at org.apache.hadoop.hdfs.NameNodeProxies.createFailoverProxyProvider(NameNodeProxies.java:513)
	... 84 more
Caused by: java.lang.RuntimeException: Could not find any configured addresses for URI hdfs://nameservice1/user/hive/warehouse/test.db/test/year_month=20187
	at org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider.<init>(ConfiguredFailoverProxyProvider.java:93)
	... 89 more