Reply
Explorer
Posts: 9
Registered: ‎11-12-2015

How to copy (distcp) between two HA HDFS clusters

Hi, 

 

The scenario is the following I have  cluster1 with HDFS HA enabled 

and I want to copy data to cluster2 with HA enabled as well. 

 

It seems I need to know the active NameNode to do that. 

 

The recommendation I've seen is to update hdfs-site.xml on the cluster 

- https://issues.apache.org/jira/browse/HDFS-6376

http://www.syscrest.com/2016/02/access-remote-ha-enabled-hdfs-oozie-distcp-action/

https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.0/bk_administration/content/distcp_between_ha...

 

But that seems like it pollutes  the cluster and also it's potentially hard  to maintain as we would need to update those if we change the NameNode's topology of one of the cluster at some point. 

 

Is there not some kind of autodiscovery mechanism. A lot of HA applications for example can specify all nodes 

e.g hdfs://node1,node2:/path/to/x 

or hdfs-zookeeper://zookeeper-address:/path/to/x

 

 

 

 

New Contributor
Posts: 1
Registered: ‎10-12-2017

Re: How to copy (distcp) between two HA HDFS clusters

Highlighted
Explorer
Posts: 9
Registered: ‎11-12-2015

Re: How to copy (distcp) between two HA HDFS clusters

That's the sort of thing I want to avoid - https://www.cloudera.com/documentation/enterprise/5-12-x/topics/cdh_admin_distcp_data_cluster_migrat...

There are like 9 options that need to be copied and maintained on the remote node. 

If I have an application that should be deployed on random hadoop cluster and work with another random hadoop cluster I'd rather not have to pass as arguments 9 options to get hdfs between the clusters working 

And if I want the application to work on random non-hadoop node then I'd need to pass 9*2 =18 properties. 

 

 

 

 

Announcements