Folks often ask about best practice for setting replication factor, evidently wondering if the default value of 3 is supported by factual data. The cool answer is, yes it is!
Rob Chansler, an excellent engineering manager and contributor to Hadoop at Yahoo for several years, posted the best material in 2011. The hard-core math is in a spreadsheet attached to Apache Jira https://issues.apache.org/jira/browse/HDFS-2535, "A Model for Data Durability", where he uses reasonable assumptions and experience from Yahoo to calculate the probable rate of data loss events at a single site due to node failures when replication is set to 3, as 0.021 events per century. See "Attachments" : "LosingBlocks.xlsx" in the Jira ticket.
e know Hadoop is used in clustered environment where we have clusters, each cluster will have multiple racks, each rack will have multiple datanodes.
So to make HDFS fault tolerant in your cluster you need to consider following failures-
DataNode failure
Rack failure
Chances of Cluster failure is fairly low so let not think about it. In the above cases you need to make sure that -
If one DataNode fails, you can get the same data from another DataNode
If the entire Rack fails, you can get the same data from another Rack
So thats why I think default replication factor is set to 3, so that not 2 replica goes to same DataNode and at-least 1 replica goes to different Rack to fulfill the above mentioned Fault-Tolerant criteria. Hope this will help.