Community Articles

mfoley · ‎02-13-2016

Folks often ask about best practice for setting replication factor, evidently wondering if the default value of 3 is supported by factual data. The cool answer is, yes it is!

Rob Chansler, an excellent engineering manager and contributor to Hadoop at Yahoo for several years, posted the best material in 2011. The hard-core math is in a spreadsheet attached to Apache Jira https://issues.apache.org/jira/browse/HDFS-2535, "A Model for Data Durability", where he uses reasonable assumptions and experience from Yahoo to calculate the probable rate of data loss events at a single site due to node failures when replication is set to 3, as 0.021 events per century. See "Attachments" : "LosingBlocks.xlsx" in the Jira ticket.

Rob also published an article in Usenix titled Data Availability and Durability with the Hadoop Distributed File System, and did a related presentation at Hadoop Summit 2011, Data Integrity and Availability of HDFS (video).

Sanjay Radia summarized this information in Hortonworks blog Data Integrity and Availability in Apache Hadoop HDFS, but that article did not focus on the replication factor specifically.

sinha_pranshu · ‎02-13-2016

e know Hadoop is used in clustered environment where we have clusters, each cluster will have multiple racks, each rack will have multiple datanodes.

So to make HDFS fault tolerant in your cluster you need to consider following failures-

DataNode failure
Rack failure

Chances of Cluster failure is fairly low so let not think about it. In the above cases you need to make sure that -

If one DataNode fails, you can get the same data from another DataNode
If the entire Rack fails, you can get the same data from another Rack

So thats why I think default replication factor is set to 3, so that not 2 replica goes to same DataNode and at-least 1 replica goes to different Rack to fulfill the above mentioned Fault-Tolerant criteria. Hope this will help.

Cloudera Community

Community Articles

HDFS Data Durability and Availability with replication = 3

Apache Hadoop

Re: HDFS Data Durability and Availability with replication = 3