Member since
10-03-2016
6
Posts
2
Kudos Received
0
Solutions
11-28-2017
10:45 AM
1 Kudo
I want to run a Spark application on an AWS EMR instance and have this application read/write from/to a remote HA-enabled HDFS cluster. To do this I have been deploying Spark and Hadoop on my EMR instance then running an AWS step which overwrites the 'default' hadoop config files with my remote HDFS clusters core-site.xml and hdfs-site.xml config files. This means I can use my remote clusters nameserviceID in my Spark logic when reading/writing to/from HDFS e.g. df.write.parquet("hdfs://mycluster/path/to/file") where mycluster is the nameserviceID of the remote HDFS cluster.
Is it possible to use the nameserviceID of my remote HDFS cluster without installing hadoop on my EMR instance and having to overwrite the core-site.xml and hdfs-site.xml config files?
I have tried setting the following config in Spark: val sc = new SparkContext
sc.hadoopConfiguration.set("dfs.nameservices", "mycluster")
sc.hadoopConfiguration.set("dfs.client.failover.proxy.provider.mycluster", "org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider")
sc.hadoopConfiguration.set("dfs.ha.namenodes.mycluster", "nn1,nn2")
sc.hadoopConfiguration.set("dfs.namenode.rpc-address.mycluster.nn1", "fqdn.of.nn1:8020")
sc.hadoopConfiguration.set("dfs.namenode.rpc-address.mycluster.nn2", "fqdn.of.nn2:8020")
but this fails with the following exception when attempting to read/write to the remote HDFS cluster using the nameserviceID: Exception in thread "main" java.io.IOException: Couldn't create proxy provider class org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider
at org.apache.hadoop.hdfs.NameNodeProxies.createFailoverProxyProvider(NameNodeProxies.java:515)
at org.apache.hadoop.hdfs.NameNodeProxies.createProxy(NameNodeProxies.java:171)
at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:678)
at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:619)
at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:149)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2653)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:92)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2687)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2669)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:371)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:231)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:148)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:139)
at org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:334)
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
at org.apache.hadoop.hdfs.NameNodeProxies.createFailoverProxyProvider(NameNodeProxies.java:498)
... 20 more
Caused by: java.lang.RuntimeException: Could not find any configured addresses for URI hdfs://mycluster/path/to/file
at org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider.<init>(ConfiguredFailoverProxyProvider.java:93)
... 25 more
The config I set on the hadoop configuration object matches that which is set for my remote HDFS cluster. Any help appreciated.
... View more
Labels:
- Labels:
-
Apache Hadoop
-
Apache Spark