Community Articles

Find and share helpful community-sourced technical articles

Turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

- Cloudera Community
- :
- Support
- :
- Community Articles
- :
- HDFS Recovery Time from Single DataNode Failure

- Article History
- Subscribe to RSS Feed
- Mark as New
- Mark as Read
- Bookmark
- Subscribe
- Email to a Friend
- Printer Friendly Page
- Report Inappropriate Content

- Article History
- Subscribe to RSS Feed
- Mark as New
- Mark as Read
- Bookmark
- Subscribe
- Email to a Friend
- Printer Friendly Page
- Report Inappropriate Content

Guru

Created on 06-22-2017 07:46 PM

The HDFS NameNode ensures that each block is sufficiently replicated. When it detects the loss of a DataNode, it instructs remaining nodes to maintain adequate replication by creating additional block replicas.

For each lost replica, the NameNode picks a (source, destination) pair where the source is an available DataNode with another replica of the block and the destination is the target for the new replica. The re-replication work can be massively parallelized in large clusters since the replica distribution is randomized.

In this article, we estimate a lower bound for the recovery time.

- The maximum IO bandwidth of each disk is 100MB/s (reads + writes). This is true for the vast majority of clusters that use spinning disks.
- The aggregate IO capacity of the cluster is limited by disk and not the network. This is not always true but helps us establish lower bounds without discussing network topologies.
- Block replicas are uniformly distributed across the cluster and disk usage is uniform. True if the HDFS balancer was run recently.

Let's assume the cluster has * n* nodes. Each each node has

The amount of data data transfer needed for recovery is twice the capacity of the lost DataNode as each replica must be read once from a source disk and written once to the target disk.

Data transfer during recovery = 2 ⋅ (Node Capacity) = (2 ⋅ p ⋅ c) TB = (2 ⋅ p ⋅ c ⋅ 1,000,000) MB

The re-replication rate is the limited by the available aggregate IO bandwidth in the cluster:

Cluster aggregate IO bandwidth = (Disk IO bandwidth) ⋅ (Number of disks) = (100 ⋅ n ⋅ p) MB/s

Thus

Minimum Recovery Time = (Data transfer during recovery) / (Cluster aggregate IO bandwidth) = (2 ⋅ p ⋅ c ⋅ 1,000,000) / (100 ⋅ n ⋅ p) = (20,000 ⋅ c/n) seconds. where: c = Mean usage of each disk in TB. n = Number of DataNodes in the cluster.

This is the absolute best case with no other load, no network bandwidth limits, and a perfectly efficient scheduler.

E.g.

In a 100 node cluster where each disk has 4TB of data, recovery from the loss of a DataNode must take at least *(20,000 ⋅ 4) / 100 = 800* seconds or approximately 13 minutes.

Clearly, the cluster size bounds the recovery time. Disk capacities being equal, a 1000 node cluster can recover 10x faster than a 100 node cluster.

The theoretical lower bound assumes that block re-replications can be instantaneously scheduled across the cluster. It also assumes that all cluster IO capacity is available for re-replication whereas in practice application reads and writes also consume IO capacity.

The NameNode schedules 2 outbound replication streams per DataNode, per heartbeat interval to throttle re-replication traffic. This throttle allows DataNodes to remain responsive to applications. The throttle can be adjusted via the configuration setting *dfs.namenode.replication.max-streams*. Let's call this * m* and the heartbeat interval

Also let's assume the mean block size in the cluster is ** b** MB. Then:

Re-replication Rate = Blocks Replicated cluster-wide per heartbeat interval = (n ⋅ m/h) Blocks/s

The total number of blocks to be re-replicated is the capacity of the lost node divided by the mean block size.

Number of Blocks Lost = (p ⋅ c) TB / b MB = (p ⋅ c ⋅ 1,000,000/b).

Thus:

Recovery Time = (Number of Blocks Lost) / (Re-replication Rate) = (p ⋅ c ⋅ 1,000,000) / (b ⋅ n ⋅ m/h) = (p ⋅ c ⋅ h ⋅ 1,000,000) / (b ⋅ n ⋅ m) seconds. where: p = Number of disks per node. c = Mean usage of each disk in TB. h = Heartbeat interval (default = 3 seconds). b = Mean block size in MB. n = Number of DataNodes in the cluster. m = dfs.namenode.replication.max-streams (default = 2)

Simplifying by plugging in the defaults for * h* and

Minimum Recovery Time (seconds) = (p ⋅ c ⋅ 1,500,000) / (b ⋅ n)

E.g. in the same cluster, assuming the mean block size is 128MB and each node has 8 disks, the practical lower bound on recovery time will be 3,750 seconds or *~1 hour*.

The recovery time can be reduced by:

- Increasing
*dfs.namenode.replication.max-streams*. However, setting this value too high can affect cluster performance. Note that increasing this value beyond 4 must be evaluated carefully and also requires changing the safeguard upper limit via*dfs.namenode.replication.max-streams-hard-limit.* - Using more nodes with smaller disks. Total cluster capacity remaining the same, a cluster with more nodes and smaller disks will recover faster.
- Avoiding predominantly small blocks.

4,403 Views

Version history

Top Kudoed Authors

User | Count |
---|---|

758 | |

379 | |

316 | |

309 | |

268 |