Support Questions

Find answers, ask questions, and share your expertise

What rules set priority of recovery from lost disks or nodes?

avatar
Rising Star

When a disk, node, or rack fails, the missing blocks are eventually noticed by the NameNode and enqueued to be replicated. What rules dictate the priority of this operation and are they controllable by the user?

1 ACCEPTED SOLUTION

avatar

Under-replicated blocks will be prioritized, queued and replicated according to the logic in UnderReplicatedBlocks.java

"Keep prioritized queues of under replicated blocks. Blocks have replication priority, with priority QUEUE_HIGHEST_PRIORITY indicating the highest priority. Having a prioritised queues allows the BlockManager to select which blocks to replicate first -it tries to give priority to data that is most at risk or considered most valuable."

This method priorities under-replicated blocks:

/** Return the priority of a block
   * @param block a under replicated block
   * @param curReplicas current number of replicas of the block
   * @param expectedReplicas expected number of replicas of the block
   * @return the priority for the blocks, between 0 and ({@link #LEVEL}-1)
   */
  private int getPriority(Block block,
                          int curReplicas, 
                          int decommissionedReplicas,
                          int expectedReplicas) {
    assert curReplicas >= 0 : "Negative replicas!";
    if (curReplicas >= expectedReplicas) {
      // Block has enough copies, but not enough racks
      return QUEUE_REPLICAS_BADLY_DISTRIBUTED;
    } else if (curReplicas == 0) {
      // If there are zero non-decommissioned replicas but there are
      // some decommissioned replicas, then assign them highest priority
      if (decommissionedReplicas > 0) {
        return QUEUE_HIGHEST_PRIORITY;
      }
      //all we have are corrupt blocks
      return QUEUE_WITH_CORRUPT_BLOCKS;
    } else if (curReplicas == 1) {
      //only on replica -risk of loss
      // highest priority
      return QUEUE_HIGHEST_PRIORITY;
    } else if ((curReplicas * 3) < expectedReplicas) {
      //there is less than a third as many blocks as requested;
      //this is considered very under-replicated
      return QUEUE_VERY_UNDER_REPLICATED;
    } else {
      //add to the normal queue for under replicated blocks
      return QUEUE_UNDER_REPLICATED;
    }
  }

Queues are order as follows:

  • QUEUE_HIGHEST_PRIORITY: the blocks that must be replicated first. That is blocks with only one copy, or blocks with zero live copies but a copy in a node being decommissioned. These blocks are at risk of loss if the disk or server on which they remain fails.
  • QUEUE_VERY_UNDER_REPLICATED: blocks that are very under-replicated compared to their expected values. Currently that means the ratio of the ratio of actual:expected means that there is less than 1:3. These blocks may not be at risk, but they are clearly considered "important".
  • QUEUE_UNDER_REPLICATED: blocks that are also under replicated, and the ratio of actual:expected is good enough that they do not need to go into the
  • QUEUE_VERY_UNDER_REPLICATED queue.
  • QUEUE_REPLICAS_BADLY_DISTRIBUTED: there are as least as many copies of a block as required, but the blocks are not adequately distributed. Loss of a rack/switch could take all copies off-line.
  • QUEUE_WITH_CORRUPT_BLOCKS: This is for blocks that are corrupt and for which there are no-non-corrupt copies (currently) available. The policy here is to keep those corrupt blocks replicated, but give blocks that are not corrupt higher priority.

View solution in original post

7 REPLIES 7

avatar

Under-replicated blocks will be prioritized, queued and replicated according to the logic in UnderReplicatedBlocks.java

"Keep prioritized queues of under replicated blocks. Blocks have replication priority, with priority QUEUE_HIGHEST_PRIORITY indicating the highest priority. Having a prioritised queues allows the BlockManager to select which blocks to replicate first -it tries to give priority to data that is most at risk or considered most valuable."

This method priorities under-replicated blocks:

/** Return the priority of a block
   * @param block a under replicated block
   * @param curReplicas current number of replicas of the block
   * @param expectedReplicas expected number of replicas of the block
   * @return the priority for the blocks, between 0 and ({@link #LEVEL}-1)
   */
  private int getPriority(Block block,
                          int curReplicas, 
                          int decommissionedReplicas,
                          int expectedReplicas) {
    assert curReplicas >= 0 : "Negative replicas!";
    if (curReplicas >= expectedReplicas) {
      // Block has enough copies, but not enough racks
      return QUEUE_REPLICAS_BADLY_DISTRIBUTED;
    } else if (curReplicas == 0) {
      // If there are zero non-decommissioned replicas but there are
      // some decommissioned replicas, then assign them highest priority
      if (decommissionedReplicas > 0) {
        return QUEUE_HIGHEST_PRIORITY;
      }
      //all we have are corrupt blocks
      return QUEUE_WITH_CORRUPT_BLOCKS;
    } else if (curReplicas == 1) {
      //only on replica -risk of loss
      // highest priority
      return QUEUE_HIGHEST_PRIORITY;
    } else if ((curReplicas * 3) < expectedReplicas) {
      //there is less than a third as many blocks as requested;
      //this is considered very under-replicated
      return QUEUE_VERY_UNDER_REPLICATED;
    } else {
      //add to the normal queue for under replicated blocks
      return QUEUE_UNDER_REPLICATED;
    }
  }

Queues are order as follows:

  • QUEUE_HIGHEST_PRIORITY: the blocks that must be replicated first. That is blocks with only one copy, or blocks with zero live copies but a copy in a node being decommissioned. These blocks are at risk of loss if the disk or server on which they remain fails.
  • QUEUE_VERY_UNDER_REPLICATED: blocks that are very under-replicated compared to their expected values. Currently that means the ratio of the ratio of actual:expected means that there is less than 1:3. These blocks may not be at risk, but they are clearly considered "important".
  • QUEUE_UNDER_REPLICATED: blocks that are also under replicated, and the ratio of actual:expected is good enough that they do not need to go into the
  • QUEUE_VERY_UNDER_REPLICATED queue.
  • QUEUE_REPLICAS_BADLY_DISTRIBUTED: there are as least as many copies of a block as required, but the blocks are not adequately distributed. Loss of a rack/switch could take all copies off-line.
  • QUEUE_WITH_CORRUPT_BLOCKS: This is for blocks that are corrupt and for which there are no-non-corrupt copies (currently) available. The policy here is to keep those corrupt blocks replicated, but give blocks that are not corrupt higher priority.

avatar
Rising Star

That is very helpful, thanks. I'm still trying to track down a missing piece though. How does this set of priorities interacts with user jobs? This is of concern because replication makes such heavy demands on the network. Say I had a cluster that was using 50% of the network for user jobs, and it loses a disk. Will the user jobs continue to get all they want, and replication the left-overs, or will the replication get the resources and user jobs the left-overs. Or is there some compromise?

avatar

Good question! The balancer has a configurable limit, which ensures that the balancer does not utilize too much network bandwidth. You'll find the parameter dfs.datanode.balance.bandwidthPerSec in the hdfs-site.xml, the default value is 1048576 bytes per second.

https://hadoop.apache.org/docs/r2.7.1/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml

"Specifies the maximum amount of bandwidth that each datanode can utilize for the balancing purpose in term of the number of bytes per second."

avatar
Rising Star

That's what I was looking for. Thanks. Recovery time is a concern for a bank I'm working with because there's a window of exposure to data loss while the data is under-replicated. This would be more useful if expressed as a % of capacity. I wonder if it would be worth asking for an enhancement?

avatar

Makes sense, but this is going to be difficult to implement. Do you mean the % of available capacity or % of general network capacity? I assume you mean the % of currently available capacity, which is changing depending on the jobs that are running. We would need a way to predict the volume of files that are going to be transferred,..... The result would be an ever changing bandwidth...Maybe it makes sense to specify a minimum and maximum bandwidth, yarn gets priority and can use the full capacity.

I need to think about that a bit more... 🙂

avatar

I'd definitely open a feature enhancement, so that we can get engineering's input on that as well. Please keep me in the loop.

avatar
Rising Star

That would be great, thanks. Please shoot me the link if you do file an enhancement. My intuition is that granting a percentage of total cluster capacity 0 to 100 would make sense, but perhaps not for the entire replication job, but only to clear the must urgent queue. Some customers really will want nothing done until safety is restored. Banks especially have all kinds of mandates, requirement, consent decrees, etc., that produce what seem from the outside to be unreasonable demands.