Created on 03-24-2017 01:53 PM
This past year Hadoop celebrated it 10th birthday. At Hadoop Summit we talked about how far Hadoop has come from its origins as a Google whitepaper to its initial implementation at Yahoo to where it is today, powering more than half of fortune 500 companies and being one, if not the most disruptive technologies of this century so far. In that time, we’ve seen platform improvements such as the elimination of the single point of failure in the Name Node, YARN opened up the platform for mixed workloads and the proliferation of new distributed computing frameworks beyond MapReduce and advances in hardware has increased speed and computing density of worker nodes.
For those of you not in the know, when you write to HDFS, Hadoop’s distributed file system, your documents and data files are broken into blocks. For each of those blocks 2 additional copies are created and all the blocks are distributed across the data nodes in the cluster. There is a scheme to how the blocks are scattered but it’s not important to this topic. This 3x data replication is designed to serve two purposes: 1) provide data redundancy in the event that there’s a hard drive or node failure. 2) provide availability for jobs to be placed on the same node where a block of data resides. The downside to this replication strategy obviously requires us to adjust our storage to compensate. Because we need to maintain a balance of overall compute to storage, we simply can’t just triple the raw storage per node but usually have to triple the number of nodes of the cluster overall which drives up the cost of the cluster both in terms of hardware and support.
So with the maturation of the platform and hardware advances over the last decade is HDFS 3x replication still necessary? Can we safely reduce replication and still realize redundancy and availability?
Hadoop engineers early on eschewed expensive RAID solutions for redundancy opting instead to build replication directly into the file system through block replication. Hard drives fail and in clusters with lots of disks failures are bound to happen weekly or maybe even daily. For arguments sake we’ll skip the scenario where an entire node is unresponsive which is most likely due to a network issue and just assume were talking about HDD failures. When a drive becomes unresponsive the Name Node assumes the blocks on that disk are lost and kicks off processes to locate the backup copies throughout the cluster and copy them to ensure the replication factor is maintained. Only one backup copy is needed to execute this action, not two. With 2x replication for a block to be “permanently lost”, a concurrent failure of the disk that contained the only backup copy would need to occur before that copy was replicated.
The extra backup ensures that a simultaneous failure of a drive where the first backup lives can still be recovered from but what’s the likelihood of such an event and does it justify the costs imposed? HDD resiliency has no doubt improved in the last 10 years reducing the frequency for drive failures and enterprise clusters now-a-days are typically networked with at least gigabit ethernet so the replication process is very fast thus shrinking the window in which a simultaneous disk failure would have to occur. Additionally, organizations more and more are leveraging cloud services and backup features to snapshot data and copy it off site in cases of a system wide data center outages so “lost blocks” aren’t necessarily truly lost.
When MapReduce or other frameworks execute against a block of data, it’s preferable to do so directly on the node where the data lives. When you have more copies of the data spread across the cluster, you increase your chances of finding a node where that data lives AND has available resources for your application to run. Alternatively, by increasing the number of jobs that can execute concurrently on a node, you effectively increase availability to all the blocks hosted on that node. With the growth of cores per socket and cheap memory, typical enterprise nodes of today are at least 6x larger than early Hadoop worker nodes. This increase in node-job density more than offsets the effects of reducing an additional block copy.
What kind of benefits could you expect by reducing the replication factor of HDFS?
Unfortunately, modifying the replication factor this is an all or nothing change. You can use the hdfs dfs -setrep command to adjust the replication factor to an existing file or files within a directory but the setting isn’t sticky. Any new files that get created going forward will fall back to the current dfs.replication setting. JIRA HDFS-199 proposes a directory by directory level replication factor but the status of this feature is currently unresolved. Undoubtedly this would be ideal for /tmp or other staging/working directories which could be optimized while still retaining the higher settings for critical HBase and managed Hive directories. If you would like to see this added, I encourage you to log into the JIRA site and up-vote this feature.
You can start by making this change to non-critical dev or test environments as a way to evaluate hardware resiliency and measure the proposed benefits. Additionally, consider dropping replication to 2 or even 1 in virtualized environments where node “local” disks are actually backed by network attached storage that may employ data redundancy features implicitly. In such cases the availability argument goes out the window as all disk reads are non-local anyway.
In summary, I question if HDFS 3x replication is still relevant today or just a hold-over from days gone by. Advances in network speed and hardware resiliency along with backup strategies allow us to safely reduce our reliance on redundant blocks. Users can realize increased performance in their clusters and this further improves the ROI offered by Hadoop. In the next article, we’ll look at some of the self-healing features of HDFS and new developments that will help shrink storage demands in upcoming versions.