- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
Access HDFS/MR from remote HDFS cluster
- Labels:
-
HDFS
Created on 09-17-2018 02:10 AM - edited 09-16-2022 06:43 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi, I have a requirement which sounds similar to symlinking in Hadoop/HDFS.
Requirement:
There are 2 production clusters: Cluster 1 and Cluster 2
I want to read data of cluster 1 from cluster 2 without copying it.
What came to my mind is, can I use hadoop fs -ls hdfs://namespace1/user/xyz on cluster 2?
I understand that cluster 2 won't know what is namespace1 - but thought of putting/appending namespace ID related info in hdfs-site.xml of cluster 2. (via advanced snippet - gateway configs)
Is this possible?
Any other alternative? hftp? (never tried both)
Thanks
Siddesh
Created 11-09-2018 02:26 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Created 09-17-2018 05:27 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
If we are talking about the listing of directories, you have to understand what happens in the background: the namenode is reading (from memory) the list of the content. The "other" HDFS has its own namenode, so your namenode would have to contact the other's.
Data retrieval: your hdfs client connects to the namenode and asks for the block location(s) of the file your are trying to access. Then it connects to the data nodes with the block request. There is just one namenode in this scenario.
But what is possible to create a HDFS federation: there you can split the whole file directory tree and assign to multiple namenodes. But this is still a 1 cluster scenario.
Created 09-17-2018 07:20 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have mentioned that, as I am reading data from Cluster 1, I am using hdfs://nameservice1:/user/abc on cluster 2.
nameservice1 is related to namenodes of cluster 1, so what is the issue?
Thanks
Siddesh
Created 09-17-2018 07:39 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I was replying to the idea of symlink.
If you just want to access data from Cluster1 on Cluster2 (or anywhere else), make sure your hdfs config files for the client points to the Cluster1. I think it is hdfs-site.xml
Created on 09-17-2018 11:49 PM - edited 09-17-2018 11:51 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I suggest, you can create two linux user account for cluster1 and cluster2 respectively and configure .bashrc.
For example:
- Create two user account produser(prod) and druser(dr).
- Create 2 directory of hdfs config dir "/mnt/hadoopprod/conf" and "/mnt/hadoopdr/conf"
- Configure hadoop home directory for each user in ~/.bashrc file
- Switch user and use the cluster 🙂
Created 10-14-2018 09:01 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
You may try below work around
1) Generally operations team create a client system and allow access to
production cluster from there rather giving access to datanode. So if it's
just a client then you use the previous solution
2) if you really want to read data from cluster 1 in cluster 2 then you can
try using namenode ip rather than nameservice
hdfs dfs -ls hdfs://namenode-ip:port/
Created 11-09-2018 02:26 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
