Selecting the Right Hardware for Your New Hadoop Cluster

by Community Manager on ‎06-10-2015 01:54 PM

Although Hadoop is designed to run on industry-standard hardware, recommending an ideal cluster configuration is not as easy as delivering a list of hardware specifications. Selecting hardware that provides the best balance of performance and economy for a given workload requires testing and validation. (For example, users with IO-intensive workloads will invest in more spindles per core.)

 

The first step in choosing a machine configuration is to understand the type of hardware your operations team already manages. Operations teams often have opinions or hard requirements about new machine purchases, and will prefer to work with hardware with which they’re already familiar. Hadoop is not the only system that benefits from efficiencies of scale. Again, as a general suggestion, if the cluster is new or you can’t accurately predict your ultimate workload, we advise that you use balanced hardware.

 

There are four types of roles in a basic Hadoop cluster: NameNode (and Standby NameNode), JobTrackerTaskTracker, and DataNode. (A node is a machine performing a particular task.) Most machines in your cluster will perform two of these roles, functioning as both DataNode (for data storage) and TaskTracker (for data processing).

 

Here are the recommended specifications for DataNode/TaskTrackers in a balanced Hadoop cluster:

  • 12-24 1-4TB hard disks in a JBOD (Just a Bunch Of Disks) configuration
  • 2 quad-/hex-/octo-core CPUs, running at least 2-2.5GHz
  • 64-512GB of RAM
  • Bonded Gigabit Ethernet or 10Gigabit Ethernet (the more storage density, the higher the network throughput needed)

The NameNode role is responsible for coordinating data storage on the cluster, and the JobTracker for coordinating data processing. (The Standby NameNode should not be co-located on the NameNode machine for clusters and will run on hardware identical to that of the NameNode.) Cloudera recommends that customers purchase enterprise-class machines for running the NameNode and JobTracker, with redundant power and enterprise-grade disks in RAID 1 or 10 configurations.

 

The NameNode will also require RAM directly proportional to the number of data blocks in the cluster. A good rule of thumb is to assume 1GB of NameNode memory for every 1 million blocks stored in the distributed file system. With 100 DataNodes in a cluster, 64GB of RAM on the NameNode provides plenty of room to grow the cluster. We also recommend having HA configured on both the NameNode and JobTracker, features that have been available in the CDH4 line for some time.

 

Here are the recommended specifications for NameNode/JobTracker/Standby NameNode nodes. The drive count will fluctuate depending on the amount of redundancy:

  • 4–6 1TB hard disks in a JBOD configuration (1 for the OS, 2 for the FS image [RAID 1], 1 for Apache ZooKeeper, and 1 for Journal node)
  • 2 quad-/hex-/octo-core CPUs, running at least 2-2.5GHz
  • 64-128GB of RAM
  • Bonded Gigabit Ethernet or 10Gigabit Ethernet

If you expect your Hadoop cluster to grow beyond 20 machines, we recommend that the initial cluster be configured as if it were to span two racks, where each rack has a top-of-rack 10 GigE switch. As the cluster grows to multiple racks, you will want to add redundant core switches to connect the top-of-rack switches with 40GigE. Having two logical racks gives the operations team a better understanding of the network requirements for intra-rack and cross-rack communication.

 

With a Hadoop cluster in place, the team can start identifying workloads and prepare to benchmark those workloads to identify hardware bottlenecks. After some time benchmarking and monitoring, the team will understand how additional machines should be configured. Heterogeneous Hadoop clusters are common, especially as they grow in size and number of use cases – so starting with a set of machines that are not “ideal” for your workload will not be a waste of time. Cloudera Manager offers templates that allow different hardware profiles to be managed in groups, making it simple to manage heterogeneous clusters.

 

Below is a list of various hardware configurations for different workloads, including our original “balanced” recommendation:

  • Light Processing Configuration (1U/machine): Two hex-core CPUs, 24-64GB memory, and 8 disk drives (1TB or 2TB)
  • Balanced Compute Configuration (1U/machine): Two hex-core CPUs, 48-128GB memory, and 12 – 16 disk drives (1TB or 2TB) directly attached using the motherboard controller. These are often available as twins with two motherboards and 24 drives in a single 2U cabinet.
  • Storage Heavy Configuration (2U/machine): Two hex-core CPUs, 48-96GB memory, and 16-24 disk drives (2TB – 4TB). This configuration will cause high network traffic in case of multiple node/rack failures.
  • Compute Intensive Configuration (2U/machine): Two hex-core CPUs, 64-512GB memory, and 4-8 disk drives (1TB or 2TB)

(Note that Cloudera expects to adopt 2×8, 2×10, and 2×12 core configurations as they arrive.)

 

The following diagram shows how a machine should be configured according to workload:

 

hw1.png

 

For more related information such as why workloads matter, going beyond MapReduce as well as other considerations, please see the original article from our blog.

 

How-to: Select the Right Hardware for Your New Hadoop Cluster

Comments
by msonst
‎10-04-2015 02:27 PM - edited ‎10-04-2015 02:27 PM

Hello,

 

I'm currently looking for hardware for a medium sized project with the following constraints:

HDFS: Import of around 100 x 100MB per day
HBase: The imported data resides in HBase, stored via openTSDB. One Entry is around 100Byte
Storm: CPU intensive processing of the data stored via openTSDB
Kafka: The imported data goes through Kafka first. The output is again written to Kafka

 

Installation:
3 Nodes, on each:
HDFS DataNode
HBase RegionServier
Storm Supervisor
Storm Worker
on two of them Kafka Broker
==> OS 1TB, 20x1TB, Two hex-core CPUs >2.5GHz, 256GB RAM, Bonded Gigabit Ethernet / 10Gigabit Ethernet

 

1Node:
HDFS Name
HBase Master
Storm Nimbus
ZooKeeper
==> OS 1TB, Storage 8x1TB RAID Controller, Two hex-core >2.5GHz, 128GB RAM, Bonded Gigabit Ethernet
/ 10Gigabit Ethernet, Redundant Power Supply

1Node:
Storm UI
Ambari
==> OS 1TB, Storage 2TB, Quad-core >2.5GHz, 16GB RAM, Gigabit Ethernet

Could you please give me an advise whether this grouping is good or not and which hardware to look for?

I read a couple of articles regarding this topic, but didn't really get to a conclusion. The HW mentioned above sounds unrealistic to me.

 

 

BR
Michael

Disclaimer: The information contained in this article was generated by third-parties and not by Cloudera or it's personnel. Cloudera cannot guarantee its accuracy or efficacy. Cloudera disclaims all warranties of any kind and users of this information assume all risk associated with it and with following the advice or directions contained herein. By visiting this page, you agree to be bound by the Terms and Conditions of Site Usage , including all disclaimers and limitations contained therein.