Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Access HDFS/MR from remote HDFS cluster

Solved Go to solution
Highlighted

Access HDFS/MR from remote HDFS cluster

New Contributor

Hi, I have a requirement which sounds similar to symlinking in Hadoop/HDFS.

 

Requirement: 
There are 2 production clusters: Cluster 1 and Cluster 2

I want to read data of cluster 1 from cluster 2 without copying it.

What came to my mind is, can I use hadoop fs -ls hdfs://namespace1/user/xyz on cluster 2?

I understand that cluster 2 won't know what is namespace1 - but thought of putting/appending namespace ID related info in hdfs-site.xml of cluster 2. (via advanced snippet - gateway configs)

Is this possible?
Any other alternative? hftp? (never tried both)

Thanks

Siddesh

1 ACCEPTED SOLUTION

Accepted Solutions

Re: Access HDFS/MR from remote HDFS cluster

Cloudera Employee
It worth to check if the use case is actually suited for using HDFS's NFS Gateway role[1] which is designed for such remote cluster access. [1] - Adding and Configuring an NFS Gateway - https://www.cloudera.com/documentation/enterprise/5-12-x/topics/admin_hdfs_nfsgateway.html
6 REPLIES 6

Re: Access HDFS/MR from remote HDFS cluster

Master Collaborator
No that is not possible at all.
If we are talking about the listing of directories, you have to understand what happens in the background: the namenode is reading (from memory) the list of the content. The "other" HDFS has its own namenode, so your namenode would have to contact the other's.
Data retrieval: your hdfs client connects to the namenode and asks for the block location(s) of the file your are trying to access. Then it connects to the data nodes with the block request. There is just one namenode in this scenario.

But what is possible to create a HDFS federation: there you can split the whole file directory tree and assign to multiple namenodes. But this is still a 1 cluster scenario.

Re: Access HDFS/MR from remote HDFS cluster

New Contributor

I have mentioned that, as I am reading data from Cluster 1, I am using hdfs://nameservice1:/user/abc on cluster 2.

 

nameservice1 is related to namenodes of cluster 1, so what is the issue?

 

Thanks

Siddesh

Re: Access HDFS/MR from remote HDFS cluster

Master Collaborator

I was replying to the idea of symlink. 

If you just want to access data from Cluster1 on Cluster2 (or anywhere else), make sure your hdfs config files for the client points to the Cluster1. I think it is hdfs-site.xml

Re: Access HDFS/MR from remote HDFS cluster

Contributor

I suggest, you can create two linux user account for cluster1 and cluster2 respectively and configure .bashrc.

For example: 

  1. Create two user account produser(prod) and druser(dr).
  2. Create 2 directory of hdfs config dir "/mnt/hadoopprod/conf" and "/mnt/hadoopdr/conf"
  3. Configure hadoop home directory for each user in ~/.bashrc file 
  4. Switch user and use the cluster :) 

Re: Access HDFS/MR from remote HDFS cluster

Contributor
Hi,

You may try below work around

1) Generally operations team create a client system and allow access to
production cluster from there rather giving access to datanode. So if it's
just a client then you use the previous solution

2) if you really want to read data from cluster 1 in cluster 2 then you can
try using namenode ip rather than nameservice

hdfs dfs -ls hdfs://namenode-ip:port/

Re: Access HDFS/MR from remote HDFS cluster

Cloudera Employee
It worth to check if the use case is actually suited for using HDFS's NFS Gateway role[1] which is designed for such remote cluster access. [1] - Adding and Configuring an NFS Gateway - https://www.cloudera.com/documentation/enterprise/5-12-x/topics/admin_hdfs_nfsgateway.html