This past year Hadoop celebrated it 10th birthday. At Hadoop
Summit we talked about how far Hadoop has come from its origins as a Google
whitepaper to its initial implementation at Yahoo to where it is today,
powering more than half of fortune 500 companies and being one, if not the most
disruptive technologies of this century so far. In that time, we’ve seen platform
improvements such as the elimination of the single point of failure in the Name
Node, YARN opened up the platform for mixed workloads and the proliferation of
new distributed computing frameworks beyond MapReduce and advances in hardware
has increased speed and computing density of worker nodes.
For those of you not in the know, when you write to HDFS,
Hadoop’s distributed file system, your documents and data files are broken into
blocks. For each of those blocks 2 additional copies are created and all the
blocks are distributed across the data nodes in the cluster. There is a scheme
to how the blocks are scattered but it’s not important to this topic. This 3x
data replication is designed to serve two purposes: 1) provide data redundancy
in the event that there’s a hard drive or node failure. 2) provide availability
for jobs to be placed on the same node where a block of data resides. The
downside to this replication strategy obviously requires us to adjust our
storage to compensate. Because we need to maintain a balance of overall compute
to storage, we simply can’t just triple the raw storage per node but usually
have to triple the number of nodes of the cluster overall which drives up the
cost of the cluster both in terms of hardware and support.
So with the maturation of the platform and hardware advances
over the last decade is HDFS 3x replication still necessary? Can we safely
reduce replication and still realize redundancy and availability?
Hadoop engineers early on eschewed expensive RAID solutions
for redundancy opting instead to build replication directly into the file
system through block replication. Hard drives fail and in clusters with lots of
disks failures are bound to happen weekly or maybe even daily. For arguments
sake we’ll skip the scenario where an entire node is unresponsive which is most
likely due to a network issue and just assume were talking about HDD failures.
When a drive becomes unresponsive the Name Node assumes the blocks on that disk
are lost and kicks off processes to locate the backup copies throughout the
cluster and copy them to ensure the replication factor is maintained. Only one
backup copy is needed to execute this action, not two. With 2x replication for
a block to be “permanently lost”, a concurrent failure of the disk that
contained the only backup copy would need to occur before that copy was
The extra backup ensures that a simultaneous failure of a
drive where the first backup lives can still be recovered from but what’s the likelihood
of such an event and does it justify the costs imposed? HDD resiliency has no doubt
improved in the last 10 years reducing the frequency for drive failures and enterprise
clusters now-a-days are typically networked with at least gigabit ethernet so
the replication process is very fast thus shrinking the window in which a
simultaneous disk failure would have to occur. Additionally, organizations more
and more are leveraging cloud services and backup features to snapshot data and
copy it off site in cases of a system wide data center outages so “lost blocks”
aren’t necessarily truly lost.
When MapReduce or other frameworks execute against a block
of data, it’s preferable to do so directly on the node where the data lives.
When you have more copies of the data spread across the cluster, you increase
your chances of finding a node where that data lives AND has available
resources for your application to run. Alternatively, by increasing the number
of jobs that can execute concurrently on a node, you effectively increase
availability to all the blocks hosted on that node. With the growth of cores
per socket and cheap memory, typical enterprise nodes of today are at least 6x
larger than early Hadoop worker nodes. This increase in node-job density more
than offsets the effects of reducing an additional block copy.
What kind of benefits could you expect by reducing the
replication factor of HDFS?
Cost – By far this is the most recognizable
savings. It’s reported that IoT and other data sources are doubling the amount
of data generated every year. By extension, this imposes a 6x increase in
storage requirements in your cluster, compression notwithstanding. And again,
you simply don’t just add more storage to a node to compensate as this will
skew the ratio of compute to storage. To scale storage in a balanced manner,
users end up having to add worker nodes which translates directly to hardware
and support costs.
Performance – Every block write that occurs in
the cluster effectively creates 2x additional network traffic as a result of
block replication. This write traffic isn’t just limited to the typical writes
by clients; It effects all writes including application log files and
intermediate job processes as well. You eliminate 33% of write related network
traffic by decreasing replication from 3 to 2. This also decreases memory pressure on
the Name Node service which is responsible for assigning the location of and
keeping track of each and every block in the cluster.
Unfortunately, modifying the replication factor this is an
all or nothing change. You can use the hdfs
dfs -setrep command to adjust the replication factor to an existing file or
files within a directory but the setting isn’t sticky. Any new files that get
created going forward will fall back to the current dfs.replication setting. JIRA HDFS-199 proposes a directory by directory level replication
factor but the status of this feature is currently unresolved. Undoubtedly this
would be ideal for /tmp or other staging/working directories which could be
optimized while still retaining the higher settings for critical HBase and
managed Hive directories. If you would like to see this added, I encourage you
to log into the JIRA site and up-vote this feature.
You can start by making this change to non-critical dev or
test environments as a way to evaluate hardware resiliency and measure the
proposed benefits. Additionally, consider dropping replication to 2 or even 1
in virtualized environments where node “local” disks are actually backed by
network attached storage that may employ data redundancy features implicitly.
In such cases the availability argument goes out the window as all disk reads
are non-local anyway.
summary, I question if HDFS 3x replication is still relevant today or just a
hold-over from days gone by. Advances in network speed and hardware resiliency along
with backup strategies allow us to safely reduce our reliance on redundant
blocks. Users can realize increased performance in their clusters and this
further improves the ROI offered by Hadoop. In the next article, we’ll look at
some of the self-healing features of HDFS and new developments that will help
shrink storage demands in upcoming versions.