Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Networking and edge nodes

avatar
New Contributor

Hello,

I am very new to networking and also very new to Hadoop. I have some questions regarding installing / running a small (2x namenode, 3x datanode) Hadoop cluster and how to connect the small cluster with the rest of our environment. We have 2 networks - a 'public' network (which is really not public at all, just allows desktops, etc. to connect to it), and a 'cluster' network which is where we connect our cluster computing together. We would like to put our Hadoop cluster in the 'cluster' network, but I'm unclear as to how the 'public' network connects to this.

Is it correct to think that this is where 'edge nodes' come in to play? We can have nodes (can they be virtualized?) which sit in both the 'public' and 'cluster' networks so that members of the public network can connect to these edge nodes and access the Hadoop cluster behind the scenes?

1 ACCEPTED SOLUTION

avatar
Rising Star

If not configured properly networking could be a real pain in Hadoop. All nodes in an Hadoop cluster need to see each other and they requires DNS (with reverse) and NTP. You can choose to deploy your cluster inside "cluster" network and use a multi-homed edge node as a bridge between "cluster" and "public" network. But you need to understand how you would access your cluster data (e.g.: JDBC thru Hive, WebHDFS for HDFS files and so on...). An edge nodes doesn't grant you access to Ambari Web UI, Ambari API, etc.. so if you deploy such a config you need to open specific TCP/IP ports in order to grant users on the "public" network to access such a service (e.g. Ambari Views). On the edge node you can deploy all clients (HDFS, YARN, Oozie, Hive, etc...) and let users to access edge node using SSH.

It really depends how you want to manage access to Hadoop services and which services you need to give access to your end users. You can use NFS gateway or Knox on the edge node but what are your needs?

Please take a look at this links:

Hadoop TCP/IP ports:

http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-...

Hadoop IDC and firewalls:

http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-...

NFS Gateway:

http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-...

Knox:

http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-...

View solution in original post

8 REPLIES 8

avatar
Rising Star

If not configured properly networking could be a real pain in Hadoop. All nodes in an Hadoop cluster need to see each other and they requires DNS (with reverse) and NTP. You can choose to deploy your cluster inside "cluster" network and use a multi-homed edge node as a bridge between "cluster" and "public" network. But you need to understand how you would access your cluster data (e.g.: JDBC thru Hive, WebHDFS for HDFS files and so on...). An edge nodes doesn't grant you access to Ambari Web UI, Ambari API, etc.. so if you deploy such a config you need to open specific TCP/IP ports in order to grant users on the "public" network to access such a service (e.g. Ambari Views). On the edge node you can deploy all clients (HDFS, YARN, Oozie, Hive, etc...) and let users to access edge node using SSH.

It really depends how you want to manage access to Hadoop services and which services you need to give access to your end users. You can use NFS gateway or Knox on the edge node but what are your needs?

Please take a look at this links:

Hadoop TCP/IP ports:

http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-...

Hadoop IDC and firewalls:

http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-...

NFS Gateway:

http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-...

Knox:

http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-...

avatar
Explorer

Do you have multiple NICs card on your any of the machine?

avatar

@Mudit Kumar that is up to you, some have multiple NICs for redundancy or higher throughput.

avatar
New Contributor

how to include setup connection for gate node ?

avatar

@flwong what exactly do you mean by gate node? An 'Edge node'? Typically in smaller clusters like the 5 node cluster you have layed out, you could leverage a master node as an 'Edge Node' too. Once your cluster grows you can then separate it out into it's own physical server.

avatar
Explorer

@andrew:The reason i asked for multiple NIC is to have one interface on public IP and another NIC ip belong to cluster network.I think robert is asking something about similar if i understood his question correctly

avatar

@Mudit Kumar That is an enterprise decision. You can have one NIC that resolves both the internal Hadoop IP Address as well as the public IP Address too.

Typically we see clients adopt some sort of dual firewall setup. Where Edge Nodes (and potentially some/all Master nodes) have access to the DMZ or at least corporate network. The Data Nodes (and remaining master nodes) are behind another firewall and can only communicate with other data nodes, edge nodes and master nodes.

avatar
Rising Star

As for multiple networks you can multi-home the nodes so you have a Public network and a Cluster Traffic network. Hardware vendors like the Cisco Refernce architecture are designed expecting multi-homing to be configured.

https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsMultihoming.html