Community Articles

Find and share helpful community-sourced technical articles.
Labels (1)
avatar
Rising Star

This past year Hadoop celebrated it 10th birthday. At Hadoop Summit we talked about how far Hadoop has come from its origins as a Google whitepaper to its initial implementation at Yahoo to where it is today, powering more than half of fortune 500 companies and being one, if not the most disruptive technologies of this century so far. In that time, we’ve seen platform improvements such as the elimination of the single point of failure in the Name Node, YARN opened up the platform for mixed workloads and the proliferation of new distributed computing frameworks beyond MapReduce and advances in hardware has increased speed and computing density of worker nodes.

For those of you not in the know, when you write to HDFS, Hadoop’s distributed file system, your documents and data files are broken into blocks. For each of those blocks 2 additional copies are created and all the blocks are distributed across the data nodes in the cluster. There is a scheme to how the blocks are scattered but it’s not important to this topic. This 3x data replication is designed to serve two purposes: 1) provide data redundancy in the event that there’s a hard drive or node failure. 2) provide availability for jobs to be placed on the same node where a block of data resides. The downside to this replication strategy obviously requires us to adjust our storage to compensate. Because we need to maintain a balance of overall compute to storage, we simply can’t just triple the raw storage per node but usually have to triple the number of nodes of the cluster overall which drives up the cost of the cluster both in terms of hardware and support.

So with the maturation of the platform and hardware advances over the last decade is HDFS 3x replication still necessary? Can we safely reduce replication and still realize redundancy and availability?

Redundancy

Hadoop engineers early on eschewed expensive RAID solutions for redundancy opting instead to build replication directly into the file system through block replication. Hard drives fail and in clusters with lots of disks failures are bound to happen weekly or maybe even daily. For arguments sake we’ll skip the scenario where an entire node is unresponsive which is most likely due to a network issue and just assume were talking about HDD failures. When a drive becomes unresponsive the Name Node assumes the blocks on that disk are lost and kicks off processes to locate the backup copies throughout the cluster and copy them to ensure the replication factor is maintained. Only one backup copy is needed to execute this action, not two. With 2x replication for a block to be “permanently lost”, a concurrent failure of the disk that contained the only backup copy would need to occur before that copy was replicated.

The extra backup ensures that a simultaneous failure of a drive where the first backup lives can still be recovered from but what’s the likelihood of such an event and does it justify the costs imposed? HDD resiliency has no doubt improved in the last 10 years reducing the frequency for drive failures and enterprise clusters now-a-days are typically networked with at least gigabit ethernet so the replication process is very fast thus shrinking the window in which a simultaneous disk failure would have to occur. Additionally, organizations more and more are leveraging cloud services and backup features to snapshot data and copy it off site in cases of a system wide data center outages so “lost blocks” aren’t necessarily truly lost.

Availability/Locality

When MapReduce or other frameworks execute against a block of data, it’s preferable to do so directly on the node where the data lives. When you have more copies of the data spread across the cluster, you increase your chances of finding a node where that data lives AND has available resources for your application to run. Alternatively, by increasing the number of jobs that can execute concurrently on a node, you effectively increase availability to all the blocks hosted on that node. With the growth of cores per socket and cheap memory, typical enterprise nodes of today are at least 6x larger than early Hadoop worker nodes. This increase in node-job density more than offsets the effects of reducing an additional block copy.

Benefits

What kind of benefits could you expect by reducing the replication factor of HDFS?

  1. Cost – By far this is the most recognizable savings. It’s reported that IoT and other data sources are doubling the amount of data generated every year. By extension, this imposes a 6x increase in storage requirements in your cluster, compression notwithstanding. And again, you simply don’t just add more storage to a node to compensate as this will skew the ratio of compute to storage. To scale storage in a balanced manner, users end up having to add worker nodes which translates directly to hardware and support costs.
  2. Performance – Every block write that occurs in the cluster effectively creates 2x additional network traffic as a result of block replication. This write traffic isn’t just limited to the typical writes by clients; It effects all writes including application log files and intermediate job processes as well. You eliminate 33% of write related network traffic by decreasing replication from 3 to 2. This also decreases memory pressure on the Name Node service which is responsible for assigning the location of and keeping track of each and every block in the cluster.

Unfortunately, modifying the replication factor this is an all or nothing change. You can use the hdfs dfs -setrep command to adjust the replication factor to an existing file or files within a directory but the setting isn’t sticky. Any new files that get created going forward will fall back to the current dfs.replication setting. JIRA HDFS-199 proposes a directory by directory level replication factor but the status of this feature is currently unresolved. Undoubtedly this would be ideal for /tmp or other staging/working directories which could be optimized while still retaining the higher settings for critical HBase and managed Hive directories. If you would like to see this added, I encourage you to log into the JIRA site and up-vote this feature.

You can start by making this change to non-critical dev or test environments as a way to evaluate hardware resiliency and measure the proposed benefits. Additionally, consider dropping replication to 2 or even 1 in virtualized environments where node “local” disks are actually backed by network attached storage that may employ data redundancy features implicitly. In such cases the availability argument goes out the window as all disk reads are non-local anyway.

In summary, I question if HDFS 3x replication is still relevant today or just a hold-over from days gone by. Advances in network speed and hardware resiliency along with backup strategies allow us to safely reduce our reliance on redundant blocks. Users can realize increased performance in their clusters and this further improves the ROI offered by Hadoop. In the next article, we’ll look at some of the self-healing features of HDFS and new developments that will help shrink storage demands in upcoming versions.

10,775 Views
Comments
avatar
Expert Contributor

I think this really depends on the workload...

I'd actually consider turning up the replication, given the following conditions: Data does not change frequently but is queried repeatedly. If you aren't writing constantly to a cluster and you have extra capacity why not consider increasing the replication factor to decrease the network traffic. If you have increased locality of data by spreading it wider across the cluster this could actually reduce traffic on the network. Yes, you'd pay a higher upfront cost for writing data, but if the workload is write once, read 1000 times, you may be better off increasing the replication factor.

Thoughts?

I want to acknowledge in a situation where you are doing some write heavy operations your article is on point.