Created on 06-22-2017 07:46 PM
The HDFS NameNode ensures that each block is sufficiently replicated. When it detects the loss of a DataNode, it instructs remaining nodes to maintain adequate replication by creating additional block replicas.
For each lost replica, the NameNode picks a (source, destination) pair where the source is an available DataNode with another replica of the block and the destination is the target for the new replica. The re-replication work can be massively parallelized in large clusters since the replica distribution is randomized.
In this article, we estimate a lower bound for the recovery time.
Let's assume the cluster has n nodes. Each each node has p disks, and the usage of each disk is c TeraBytes. The data usage of each node is thus (p ⋅ c) TB.
The amount of data data transfer needed for recovery is twice the capacity of the lost DataNode as each replica must be read once from a source disk and written once to the target disk.
Data transfer during recovery = 2 ⋅ (Node Capacity) = (2 ⋅ p ⋅ c) TB = (2 ⋅ p ⋅ c ⋅ 1,000,000) MB
The re-replication rate is the limited by the available aggregate IO bandwidth in the cluster:
Cluster aggregate IO bandwidth = (Disk IO bandwidth) ⋅ (Number of disks) = (100 ⋅ n ⋅ p) MB/s
Thus
Minimum Recovery Time = (Data transfer during recovery) / (Cluster aggregate IO bandwidth) = (2 ⋅ p ⋅ c ⋅ 1,000,000) / (100 ⋅ n ⋅ p) = (20,000 ⋅ c/n) seconds. where: c = Mean usage of each disk in TB. n = Number of DataNodes in the cluster.
This is the absolute best case with no other load, no network bandwidth limits, and a perfectly efficient scheduler.
E.g.
In a 100 node cluster where each disk has 4TB of data, recovery from the loss of a DataNode must take at least (20,000 ⋅ 4) / 100 = 800 seconds or approximately 13 minutes.
Clearly, the cluster size bounds the recovery time. Disk capacities being equal, a 1000 node cluster can recover 10x faster than a 100 node cluster.
The theoretical lower bound assumes that block re-replications can be instantaneously scheduled across the cluster. It also assumes that all cluster IO capacity is available for re-replication whereas in practice application reads and writes also consume IO capacity.
The NameNode schedules 2 outbound replication streams per DataNode, per heartbeat interval to throttle re-replication traffic. This throttle allows DataNodes to remain responsive to applications. The throttle can be adjusted via the configuration setting dfs.namenode.replication.max-streams. Let's call this m and the heartbeat interval h.
Also let's assume the mean block size in the cluster is b MB. Then:
Re-replication Rate = Blocks Replicated cluster-wide per heartbeat interval = (n ⋅ m/h) Blocks/s
The total number of blocks to be re-replicated is the capacity of the lost node divided by the mean block size.
Number of Blocks Lost = (p ⋅ c) TB / b MB = (p ⋅ c ⋅ 1,000,000/b).
Thus:
Recovery Time = (Number of Blocks Lost) / (Re-replication Rate) = (p ⋅ c ⋅ 1,000,000) / (b ⋅ n ⋅ m/h) = (p ⋅ c ⋅ h ⋅ 1,000,000) / (b ⋅ n ⋅ m) seconds. where: p = Number of disks per node. c = Mean usage of each disk in TB. h = Heartbeat interval (default = 3 seconds). b = Mean block size in MB. n = Number of DataNodes in the cluster. m = dfs.namenode.replication.max-streams (default = 2)
Simplifying by plugging in the defaults for h and m, we get
Minimum Recovery Time (seconds) = (p ⋅ c ⋅ 1,500,000) / (b ⋅ n)
E.g. in the same cluster, assuming the mean block size is 128MB and each node has 8 disks, the practical lower bound on recovery time will be 3,750 seconds or ~1 hour.
The recovery time can be reduced by: