Created on 01-03-2016 11:00 PM - edited 09-16-2022 01:33 AM
Hadoop can use a lot of network bandwidth as well as disk storage. Typical configuration options such as data replication settings to 3x can further increase demands to the network as can normally desirable technologies such as network storage. For example, loading a 1 TB file with a 3x replication factor can end up consuming 3 TB of network traffic to successfully load the file and another 3 TB of data movement if network drives are utilized.
The sections below highlight an aggregation of previously identified best practices to best setup the network to support the deployment of an HDP cluster.
HDP has functionality to run Hadoop using the dual-homed environment. You will need to set the two properties below as shown and set them both to true in the cluster and on your client machines. These properties are located by default in hdfs-default.xml located in hadoop-core.jar.
<property>
<name>dfs.client.use.datanode.hostname</name>
<value>true</value>
<description>Whether clients should use DataNode hostnames whenconnecting to data nodes.
</description>
</property>
<property>
<name>dfs.datanode.use.datanode.hostname</name>
<value>true</value>
<description>Whether datanodes should use datanode hostnames whenconnecting to other datanodes for data transfer.
</description>
NOTE: this configuration is required because when the DataNode registers with the NameNode it will use an IP address by default. This means that the client might not be able to locate the DataNode as it resides on a different network and configuration changes are required for the system to use hostnames instead of IP addresses for the name resolution. Please ensure that the hostnames resolve correctly and are reachable from both networks.
Firewalls should not be present between nodes in the cluster. Placing a firewall between cluster nodes will impact the cluster performance.
Placing a firewall around the cluster can be enabled with the understanding that various firewall ports will need to be enabled for communication to the cluster or that Apache Knox will ned to be enabled.
For a complete list of network ports used for each of the services please consult the documentation available at the link below:
It is only necessary to open firewall ports described in the documentation referenced above only when that service is actually in use for your cluster. For example, many Customers do not use the Accumulo service and thus none of the ports referenced in section 2.1 of the above referenced documentation are required to be opened in your firewall.
The Gateway nodes are specially client nodes. The Gateway nodes should have sufficient firewall port access for only the services planned for access by external processes and end users and should not have access to internal only or administrative services. To minimize network traffic between the cluster and the Gateway node, you should try to restrict access such that end users cannot transfer large quantities of data from the HDP cluster through the gateway. For efficient network utilization the gateway should be used only for aggregated and reduced sized data subsets.
For a cluster with a tightly defined firewall you should consider also implementing Apache Knox to provide external to the cluster client processes and users access to the data and processes available through the cluster while still maintaining security administered using Apache Knox. More details on Apache Knox with the HDP cluster is available at:
{note: the AWS service is always changing, so the recommendations detailed below may not still be valid for your cluster}
Created on 11-14-2016 07:23 PM
Spoke to @Mark Johnson and still relevant.
Created on 11-29-2016 12:36 AM
A couple of comments:
1. The section on setting up a dual homed network is correct, but misleading. Most people who set up dual-homed networks would expect to spread at least some of the load over the interfaces, but Hadoop code is just not network aware in that sense. So it is *much* better to use bonding/link aggregation for network redundancy.
2. In this day and age, don't even think about using 1Gb ports. Use at least 2 10Gb ports. Cloud providers are *today* installing 50Gb networks to their servers - 2x25Gb or 1x50Gb. You're wasting a LOT of CPU if you don't give them enough bandwidth.