Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Access HDFS/MR from remote HDFS cluster

avatar

Hi, I have a requirement which sounds similar to symlinking in Hadoop/HDFS.

 

Requirement: 
There are 2 production clusters: Cluster 1 and Cluster 2

I want to read data of cluster 1 from cluster 2 without copying it.

What came to my mind is, can I use hadoop fs -ls hdfs://namespace1/user/xyz on cluster 2?

I understand that cluster 2 won't know what is namespace1 - but thought of putting/appending namespace ID related info in hdfs-site.xml of cluster 2. (via advanced snippet - gateway configs)

Is this possible?
Any other alternative? hftp? (never tried both)

Thanks

Siddesh

1 ACCEPTED SOLUTION

avatar
Contributor
It worth to check if the use case is actually suited for using HDFS's NFS Gateway role[1] which is designed for such remote cluster access. [1] - Adding and Configuring an NFS Gateway - https://www.cloudera.com/documentation/enterprise/5-12-x/topics/admin_hdfs_nfsgateway.html

View solution in original post

6 REPLIES 6

avatar
No that is not possible at all.
If we are talking about the listing of directories, you have to understand what happens in the background: the namenode is reading (from memory) the list of the content. The "other" HDFS has its own namenode, so your namenode would have to contact the other's.
Data retrieval: your hdfs client connects to the namenode and asks for the block location(s) of the file your are trying to access. Then it connects to the data nodes with the block request. There is just one namenode in this scenario.

But what is possible to create a HDFS federation: there you can split the whole file directory tree and assign to multiple namenodes. But this is still a 1 cluster scenario.

avatar

I have mentioned that, as I am reading data from Cluster 1, I am using hdfs://nameservice1:/user/abc on cluster 2.

 

nameservice1 is related to namenodes of cluster 1, so what is the issue?

 

Thanks

Siddesh

avatar

I was replying to the idea of symlink. 

If you just want to access data from Cluster1 on Cluster2 (or anywhere else), make sure your hdfs config files for the client points to the Cluster1. I think it is hdfs-site.xml

avatar
Explorer

I suggest, you can create two linux user account for cluster1 and cluster2 respectively and configure .bashrc.

For example: 

  1. Create two user account produser(prod) and druser(dr).
  2. Create 2 directory of hdfs config dir "/mnt/hadoopprod/conf" and "/mnt/hadoopdr/conf"
  3. Configure hadoop home directory for each user in ~/.bashrc file 
  4. Switch user and use the cluster 🙂 

avatar
Explorer
Hi,

You may try below work around

1) Generally operations team create a client system and allow access to
production cluster from there rather giving access to datanode. So if it's
just a client then you use the previous solution

2) if you really want to read data from cluster 1 in cluster 2 then you can
try using namenode ip rather than nameservice

hdfs dfs -ls hdfs://namenode-ip:port/

avatar
Contributor
It worth to check if the use case is actually suited for using HDFS's NFS Gateway role[1] which is designed for such remote cluster access. [1] - Adding and Configuring an NFS Gateway - https://www.cloudera.com/documentation/enterprise/5-12-x/topics/admin_hdfs_nfsgateway.html