I've a 4 nodes cluster with on public IP (Nodes have access to the internet and they have a unique ip)
usually I use a double ssh tunnel to access nodes and services by port forwarding.
My question is how I could use WebHDFS from my local machine, I found a problem with how to specify the ip address oh the VM1.local node. From inside the cluster I use this command:
curl -i "http://vm1.local:50070/webhdfs/v1/user/root/?op=LISTSTATUS"
( or 10.10.10.1 in the place of vm1.local )
My config is like this:
Public repond: IP : 151.xx.xx.xx
VM | eth0 | | eth1 | VM1 : 192.168.1.10 10.10.10.1 VM2 : 192.168.1.11 10.10.10.2 VM3 : 192.168.1.12 10.10.10.3 VM4 : 192.168.1.13 10.10.10.4
If you put your external IP there, "http://externalIP:50070/webhdfs/v1/user/root/?op=LISTSTATUS", you should be able to get response.
If you are not getting response, check with the following command to see if your java process is listening to all interfaces.
netstat -lnp | grep 50070
You should see an output like below.
tcp 0 0 <your ip>:50070 0.0.0.0:* LISTEN 2108/java
If it's still not working, can you post the actual error that you are getting?
The Nodes doesn’t have a public IP address, the command returned
tcp 0 0 10.10.10.1:50070 0.0.0.0:* LISTEN 6754/java
The IP address 10.10.10.1 is visible only from within the cluster, if I use the other IP 192.168.1.10 i get curl: (7) couldn't connect to host
even if I'm on a node... the only way is to use 10.10.10.1
Ok. Now, I don't get your usecase. Are you trying to get this data from outside the cluster? Are these VMs are one server node? If you are trying to get this data from VMs to your host, you can see if you have NAT enabled. Something like host only network works as well but then you are adding a new network interface for that.
@Zaher Mahdhi, first let me restate your question to make sure I understand the problem. You said that you have a 4-node cluster but only 1 node has a public IP addressable from outside the cluster, right? And when you initiate communication with a master node, you have set up port forwarding in the publicly addressable node to forward to the desired master. (tricky!) However, when you try to access a file in HDFS (actually whether by WebHDFS or regular HDFS client), the HDFS master (namenode) redirects you to the correct slave (datanode), and the IP address it gives you for the datanode is of course not addressable from outside the cluster. Right?
Definitely the most straightforward solution is to run Knox on your publicly addressable node. Knox provides secure proxying for all other services in the cluster, while hiding the internal structure of the cluster from outside users. WebHDFS will work, and as a side benefit you will no longer have to set up your port forwarding scheme. Knox is also reasonably easy to configure. The only small negative is that Knox also bottlenecks all communication through the Knox server. But that's the way your port forwarding scheme works too, right? And there's no other evident solution if you can't give public IP addresses to the other nodes. So not a negative for you.
The other, traditional but non-secure and functionally limited solution, is to require logging in on the publicly addressable node and doing all your work from there. I assume you've thought of that and don't find it satisfactory.
The only other possible approach I have not tried in this context, so I'm not certain it will work, but I think it can. This would be to set the parameter 'dfs.client.use.datanode.hostname' to "true" in hdfs-site.xml on all namenodes, datanodes, and clients, and then set up the publicly addressable node as an IP proxy for each of the FQDN-named datanodes, for access by your Client node. See https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsMultihoming.html#Clients_u... for the semantics of dfs.client.use.datanode.hostname. Using this parameter requires that:
All the proxying has to be done down in the network layer, but if you know enough to set this up you can skip learning to configure Knox :-) The article Parameters for Multi-Homing is not directly related, but may help you understand the intended use of dfs.client.use.datanode.hostname (which is not exactly this use) and the assumptions that go with it.
Hope this helps. Do consider the Knox solution, for both ease of setup and maintainability.
Thank's @Matt Foley for your answer, The issue here that none of my nodes have a public IP address, I've an intermediate machine that I ssh into and i ssh again into the cluster. My cluster is not accessible directly ..
Yes @Zaher Mahdhi, your intermediate machine is effectively a fifth machine in the cluster (as far as clients are concerned). In my answer just replace "the publicly addressible node" with "the intermediate machine", and my answer still applies in all regards.