Created 11-28-2017 10:45 AM
I want to run a Spark application on an AWS EMR instance and have this application read/write from/to a remote HA-enabled HDFS cluster. To do this I have been deploying Spark and Hadoop on my EMR instance then running an AWS step which overwrites the 'default' hadoop config files with my remote HDFS clusters core-site.xml and hdfs-site.xml config files.
This means I can use my remote clusters nameserviceID in my Spark logic when reading/writing to/from HDFS e.g. df.write.parquet("hdfs://mycluster/path/to/file") where mycluster is the nameserviceID of the remote HDFS cluster.
Is it possible to use the nameserviceID of my remote HDFS cluster without installing hadoop on my EMR instance and having to overwrite the core-site.xml and hdfs-site.xml config files?
I have tried setting the following config in Spark:
val sc = new SparkContext sc.hadoopConfiguration.set("dfs.nameservices", "mycluster") sc.hadoopConfiguration.set("dfs.client.failover.proxy.provider.mycluster", "org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider") sc.hadoopConfiguration.set("dfs.ha.namenodes.mycluster", "nn1,nn2") sc.hadoopConfiguration.set("dfs.namenode.rpc-address.mycluster.nn1", "fqdn.of.nn1:8020") sc.hadoopConfiguration.set("dfs.namenode.rpc-address.mycluster.nn2", "fqdn.of.nn2:8020")
but this fails with the following exception when attempting to read/write to the remote HDFS cluster using the nameserviceID:
Exception in thread "main" java.io.IOException: Couldn't create proxy provider class org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider at org.apache.hadoop.hdfs.NameNodeProxies.createFailoverProxyProvider(NameNodeProxies.java:515) at org.apache.hadoop.hdfs.NameNodeProxies.createProxy(NameNodeProxies.java:171) at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:678) at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:619) at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:149) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2653) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:92) at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2687) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2669) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:371) at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295) at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:231) at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:148) at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:139) at org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:334) Caused by: java.lang.reflect.InvocationTargetException at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:422) at org.apache.hadoop.hdfs.NameNodeProxies.createFailoverProxyProvider(NameNodeProxies.java:498) ... 20 more Caused by: java.lang.RuntimeException: Could not find any configured addresses for URI hdfs://mycluster/path/to/file at org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider.<init>(ConfiguredFailoverProxyProvider.java:93) ... 25 more
The config I set on the hadoop configuration object matches that which is set for my remote HDFS cluster.
Any help appreciated.
Created 03-19-2020 02:04 PM
Same problème here
Created 03-19-2020 05:45 PM
Hi @zakariadem
While we welcome your question, just based on the lack of responses to the original question in this thread since it was posted in Nov 2017, we think you would be much more likely to obtain a suitable answer if you posted it to the appropriate AWS forum for EMR.
Created 07-13-2021 02:15 AM
Same problem:
21/07/13 15:56:34 ERROR ExecutionImpl: Couldn't create proxy provider class org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider
java.io.IOException: Couldn't create proxy provider class org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider
at org.apache.hadoop.hdfs.NameNodeProxies.createFailoverProxyProvider(NameNodeProxies.java:530)
at org.apache.hadoop.hdfs.NameNodeProxies.createProxy(NameNodeProxies.java:172)
at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:662)
at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:606)
at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:148)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2596)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2630)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2612)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
at org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:97)
at org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodes(TokenCache.java:80)
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:205)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:637)
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at org.apache.hadoop.hdfs.NameNodeProxies.createFailoverProxyProvider(NameNodeProxies.java:513)
... 84 more
Caused by: java.lang.RuntimeException: Could not find any configured addresses for URI hdfs://nameservice1/user/hive/warehouse/test.db/test/year_month=20187
at org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider.<init>(ConfiguredFailoverProxyProvider.java:93)
... 89 more