Community Articles

Find and share helpful community-sourced technical articles.
Announcements
Celebrating as our community reaches 100,000 members! Thank you!
avatar
Rising Star

I.Overview

Hadoop can use a lot of network bandwidth as well as disk storage. Typical configuration options such as data replication settings to 3x can further increase demands to the network as can normally desirable technologies such as network storage. For example, loading a 1 TB file with a 3x replication factor can end up consuming 3 TB of network traffic to successfully load the file and another 3 TB of data movement if network drives are utilized.

The sections below highlight an aggregation of previously identified best practices to best setup the network to support the deployment of an HDP cluster.

II.Network Topology

  • Machines should be on an isolated network from the rest of the data center. This means that no other applications or nodes should share network I/O with the Hadoop infrastructure. This is recommended as Hadoop is I/O intensive, and all other interference should be removed for a performant cluster.
  • Machines should have static IPs. This will enable stability in the network configuration. If the network were configured with dynamic IPs, on a machine reboot or if the DNS lease were to expire then the machine’s IP address would change, and this would cause the Hadoop services to malfunction.
  • Reverse DNS should be setup. Reverse DNS ensures that a node’s hostname can be looked up through the IP address. Certain Hadoop functionalities utilize and require reverse DNS.
  • Dedicated “Top of Rack” (TOR) switches to Hadoop
  • Use dedicated core switching blades or switches
  • Ensure application servers are “close” to Hadoop
  • Consider Ethernet bonding for increased capacity
  • All clients and cluster nodes require Network access and open firewall ports each of the services for communication between the servers.
  • If deployed to a cloud environment, then make certain all Hadoop cluster Master and Data nodes are on the same network zone (this is especially important when utilizing cloud services such as AWS and Azure).
  • If deployed to a physical environment, then make certain to place the cluster on in a VLAN.
  • The Data node and Client nodes should at the minimum have a 2 x 1 Gb Ethernet a typical recommended Network controller is 1 x 10 Gb Ethernet.
  • For the switch communicating between the racks you will want to establish the fastest Ethernet connections possible with the most capacity.

I.Setting Up a Dual Homed Network

HDP has functionality to run Hadoop using the dual-homed environment. You will need to set the two properties below as shown and set them both to true in the cluster and on your client machines. These properties are located by default in hdfs-default.xml located in hadoop-core.jar.

<property>

<name>dfs.client.use.datanode.hostname</name>

<value>true</value>

<description>Whether clients should use DataNode hostnames whenconnecting to data nodes.

</description>

</property>

<property>

<name>dfs.datanode.use.datanode.hostname</name>

<value>true</value>

<description>Whether datanodes should use datanode hostnames whenconnecting to other datanodes for data transfer.

</description>

NOTE: this configuration is required because when the DataNode registers with the NameNode it will use an IP address by default. This means that the client might not be able to locate the DataNode as it resides on a different network and configuration changes are required for the system to use hostnames instead of IP addresses for the name resolution. Please ensure that the hostnames resolve correctly and are reachable from both networks.

II.Firewall Definitions

Firewalls should not be present between nodes in the cluster. Placing a firewall between cluster nodes will impact the cluster performance.

Placing a firewall around the cluster can be enabled with the understanding that various firewall ports will need to be enabled for communication to the cluster or that Apache Knox will ned to be enabled.

For a complete list of network ports used for each of the services please consult the documentation available at the link below:

It is only necessary to open firewall ports described in the documentation referenced above only when that service is actually in use for your cluster. For example, many Customers do not use the Accumulo service and thus none of the ports referenced in section 2.1 of the above referenced documentation are required to be opened in your firewall.

The Gateway nodes are specially client nodes. The Gateway nodes should have sufficient firewall port access for only the services planned for access by external processes and end users and should not have access to internal only or administrative services. To minimize network traffic between the cluster and the Gateway node, you should try to restrict access such that end users cannot transfer large quantities of data from the HDP cluster through the gateway. For efficient network utilization the gateway should be used only for aggregated and reduced sized data subsets.

For a cluster with a tightly defined firewall you should consider also implementing Apache Knox to provide external to the cluster client processes and users access to the data and processes available through the cluster while still maintaining security administered using Apache Knox. More details on Apache Knox with the HDP cluster is available at:

III.Some Networking Considerations when Running on AWS

{note: the AWS service is always changing, so the recommendations detailed below may not still be valid for your cluster}

  • Most AWS networking configurations are specifying an access priority. As such it is generally a good idea to select the highest network speed option when configuring your network. You will want to pay close attention to network and especially burst network utilization to be aware of any potential impact to the HDP cluster caused by AWS network capacity issues.
  • Consistently using the DNS name related to the AWS issued IPv4 address will both reduce costs for unnecessary data transfers and should also reduce network latency.
  • Make certain the network has fully started before starting the HDP cluster instance.

IV.Misc. Networking considerations

  • Make certain all members to the HDP cluster have passwordless SSH configured.
  • Basic heartbeat (typical 3x/second) and administrative commands generated by the Hadoop cluster are infrequent and transfer only small amounts of data except in the extremely large cluster deployments.
  • Keep in mind that NAS disks will require more network utilization than plain old disk drives resident on the data node.
  • Make certain both host Fully Qualified Host Names as well as Host aliases are defined and referenceable by all nodes within the cluster.
  • Ensure the network interface is consistently defined for all members of the Hadoop cluster (i.e. MTU settings should be consistent)
  • Look into defining MTU for all interfaces on the cluster to support Jumbo Frames (typically MTU=9000). But only do this make certain that all nodes and switches support this functionality. Inconsistent MTU or undefined MTU configurations can produce serious problems with the network.
  • Disable Transparent Huge Page compaction for all nodes on the cluster.
  • Make certain all all of the HDP cluster’s network connections are monitored for collisions and lost packets. Have the Network administration team tune the network as required to address any issues identified as part of the network monitoring.
30,269 Views
Comments

Spoke to @Mark Johnson and still relevant.

avatar
Explorer

A couple of comments:

1. The section on setting up a dual homed network is correct, but misleading. Most people who set up dual-homed networks would expect to spread at least some of the load over the interfaces, but Hadoop code is just not network aware in that sense. So it is *much* better to use bonding/link aggregation for network redundancy.

2. In this day and age, don't even think about using 1Gb ports. Use at least 2 10Gb ports. Cloud providers are *today* installing 50Gb networks to their servers - 2x25Gb or 1x50Gb. You're wasting a LOT of CPU if you don't give them enough bandwidth.