Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Use WebHDFS from outside the cluster

Highlighted

Use WebHDFS from outside the cluster

Contributor

Hi,

I've a 4 nodes cluster with on public IP (Nodes have access to the internet and they have a unique ip)

usually I use a double ssh tunnel to access nodes and services by port forwarding.

My question is how I could use WebHDFS from my local machine, I found a problem with how to specify the ip address oh the VM1.local node. From inside the cluster I use this command:

curl -i "http://vm1.local:50070/webhdfs/v1/user/root/?op=LISTSTATUS"

( or 10.10.10.1 in the place of vm1.local )

My config is like this:

Public repond: IP : 151.xx.xx.xx

with

VM | eth0 | | eth1 | VM1 : 192.168.1.10 10.10.10.1 VM2 : 192.168.1.11 10.10.10.2 VM3 : 192.168.1.12 10.10.10.3 VM4 : 192.168.1.13 10.10.10.4

Thanks,

7 REPLIES 7
Highlighted

Re: Use WebHDFS from outside the cluster

Guru

If you put your external IP there, "http://externalIP:50070/webhdfs/v1/user/root/?op=LISTSTATUS", you should be able to get response.

If you are not getting response, check with the following command to see if your java process is listening to all interfaces.

netstat -lnp | grep 50070 

You should see an output like below.

tcp 0 0 <your ip>:50070 0.0.0.0:* LISTEN 2108/java

If it's still not working, can you post the actual error that you are getting?

Highlighted

Re: Use WebHDFS from outside the cluster

Contributor

The Nodes doesn’t have a public IP address, the command returned

tcp 0 0 10.10.10.1:50070 0.0.0.0:* LISTEN 6754/java

The IP address 10.10.10.1 is visible only from within the cluster, if I use the other IP 192.168.1.10 i get curl: (7) couldn't connect to host

even if I'm on a node... the only way is to use 10.10.10.1

Highlighted

Re: Use WebHDFS from outside the cluster

Guru

Ok. Now, I don't get your usecase. Are you trying to get this data from outside the cluster? Are these VMs are one server node? If you are trying to get this data from VMs to your host, you can see if you have NAT enabled. Something like host only network works as well but then you are adding a new network interface for that.

Highlighted

Re: Use WebHDFS from outside the cluster

@Zaher Mahdhi, first let me restate your question to make sure I understand the problem. You said that you have a 4-node cluster but only 1 node has a public IP addressable from outside the cluster, right? And when you initiate communication with a master node, you have set up port forwarding in the publicly addressable node to forward to the desired master. (tricky!) However, when you try to access a file in HDFS (actually whether by WebHDFS or regular HDFS client), the HDFS master (namenode) redirects you to the correct slave (datanode), and the IP address it gives you for the datanode is of course not addressable from outside the cluster. Right?

Definitely the most straightforward solution is to run Knox on your publicly addressable node. Knox provides secure proxying for all other services in the cluster, while hiding the internal structure of the cluster from outside users. WebHDFS will work, and as a side benefit you will no longer have to set up your port forwarding scheme. Knox is also reasonably easy to configure. The only small negative is that Knox also bottlenecks all communication through the Knox server. But that's the way your port forwarding scheme works too, right? And there's no other evident solution if you can't give public IP addresses to the other nodes. So not a negative for you.

The other, traditional but non-secure and functionally limited solution, is to require logging in on the publicly addressable node and doing all your work from there. I assume you've thought of that and don't find it satisfactory.

The only other possible approach I have not tried in this context, so I'm not certain it will work, but I think it can. This would be to set the parameter 'dfs.client.use.datanode.hostname' to "true" in hdfs-site.xml on all namenodes, datanodes, and clients, and then set up the publicly addressable node as an IP proxy for each of the FQDN-named datanodes, for access by your Client node. See https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsMultihoming.html#Clients_u... for the semantics of dfs.client.use.datanode.hostname. Using this parameter requires that:

  • the cluster nodes have real FQDNs and not just IP addresses,
  • DNS and rDNS services accessible to the cluster are set up, and all cluster nodes know their own FQDN, so that "round-trip" DNS and rDNS calls work for all cluster nodes from within the cluster.

All the proxying has to be done down in the network layer, but if you know enough to set this up you can skip learning to configure Knox :-) The article Parameters for Multi-Homing is not directly related, but may help you understand the intended use of dfs.client.use.datanode.hostname (which is not exactly this use) and the assumptions that go with it.

Hope this helps. Do consider the Knox solution, for both ease of setup and maintainability.

Re: Use WebHDFS from outside the cluster

Contributor

Thank's @Matt Foley for your answer, The issue here that none of my nodes have a public IP address, I've an intermediate machine that I ssh into and i ssh again into the cluster. My cluster is not accessible directly ..

Highlighted

Re: Use WebHDFS from outside the cluster

Guru
@Zaher Mahdhi

Even in this case, you can go with Matt's solution of putting knox on the 'intermediate machine' that has public IP. This is the most straightforward solution.

Highlighted

Re: Use WebHDFS from outside the cluster

Yes @Zaher Mahdhi, your intermediate machine is effectively a fifth machine in the cluster (as far as clients are concerned). In my answer just replace "the publicly addressible node" with "the intermediate machine", and my answer still applies in all regards.

Don't have an account?
Coming from Hortonworks? Activate your account here