Created 11-12-2015 07:47 PM
When a disk, node, or rack fails, the missing blocks are eventually noticed by the NameNode and enqueued to be replicated. What rules dictate the priority of this operation and are they controllable by the user?
Created 11-13-2015 01:27 PM
Under-replicated blocks will be prioritized, queued and replicated according to the logic in UnderReplicatedBlocks.java
"Keep prioritized queues of under replicated blocks. Blocks have replication priority, with priority QUEUE_HIGHEST_PRIORITY indicating the highest priority. Having a prioritised queues allows the BlockManager to select which blocks to replicate first -it tries to give priority to data that is most at risk or considered most valuable."
This method priorities under-replicated blocks:
/** Return the priority of a block * @param block a under replicated block * @param curReplicas current number of replicas of the block * @param expectedReplicas expected number of replicas of the block * @return the priority for the blocks, between 0 and ({@link #LEVEL}-1) */ private int getPriority(Block block, int curReplicas, int decommissionedReplicas, int expectedReplicas) { assert curReplicas >= 0 : "Negative replicas!"; if (curReplicas >= expectedReplicas) { // Block has enough copies, but not enough racks return QUEUE_REPLICAS_BADLY_DISTRIBUTED; } else if (curReplicas == 0) { // If there are zero non-decommissioned replicas but there are // some decommissioned replicas, then assign them highest priority if (decommissionedReplicas > 0) { return QUEUE_HIGHEST_PRIORITY; } //all we have are corrupt blocks return QUEUE_WITH_CORRUPT_BLOCKS; } else if (curReplicas == 1) { //only on replica -risk of loss // highest priority return QUEUE_HIGHEST_PRIORITY; } else if ((curReplicas * 3) < expectedReplicas) { //there is less than a third as many blocks as requested; //this is considered very under-replicated return QUEUE_VERY_UNDER_REPLICATED; } else { //add to the normal queue for under replicated blocks return QUEUE_UNDER_REPLICATED; } }
Queues are order as follows:
Created 11-13-2015 01:27 PM
Under-replicated blocks will be prioritized, queued and replicated according to the logic in UnderReplicatedBlocks.java
"Keep prioritized queues of under replicated blocks. Blocks have replication priority, with priority QUEUE_HIGHEST_PRIORITY indicating the highest priority. Having a prioritised queues allows the BlockManager to select which blocks to replicate first -it tries to give priority to data that is most at risk or considered most valuable."
This method priorities under-replicated blocks:
/** Return the priority of a block * @param block a under replicated block * @param curReplicas current number of replicas of the block * @param expectedReplicas expected number of replicas of the block * @return the priority for the blocks, between 0 and ({@link #LEVEL}-1) */ private int getPriority(Block block, int curReplicas, int decommissionedReplicas, int expectedReplicas) { assert curReplicas >= 0 : "Negative replicas!"; if (curReplicas >= expectedReplicas) { // Block has enough copies, but not enough racks return QUEUE_REPLICAS_BADLY_DISTRIBUTED; } else if (curReplicas == 0) { // If there are zero non-decommissioned replicas but there are // some decommissioned replicas, then assign them highest priority if (decommissionedReplicas > 0) { return QUEUE_HIGHEST_PRIORITY; } //all we have are corrupt blocks return QUEUE_WITH_CORRUPT_BLOCKS; } else if (curReplicas == 1) { //only on replica -risk of loss // highest priority return QUEUE_HIGHEST_PRIORITY; } else if ((curReplicas * 3) < expectedReplicas) { //there is less than a third as many blocks as requested; //this is considered very under-replicated return QUEUE_VERY_UNDER_REPLICATED; } else { //add to the normal queue for under replicated blocks return QUEUE_UNDER_REPLICATED; } }
Queues are order as follows:
Created 11-16-2015 04:11 PM
That is very helpful, thanks. I'm still trying to track down a missing piece though. How does this set of priorities interacts with user jobs? This is of concern because replication makes such heavy demands on the network. Say I had a cluster that was using 50% of the network for user jobs, and it loses a disk. Will the user jobs continue to get all they want, and replication the left-overs, or will the replication get the resources and user jobs the left-overs. Or is there some compromise?
Created 11-17-2015 10:56 AM
Good question! The balancer has a configurable limit, which ensures that the balancer does not utilize too much network bandwidth. You'll find the parameter dfs.datanode.balance.bandwidthPerSec in the hdfs-site.xml, the default value is 1048576 bytes per second.
https://hadoop.apache.org/docs/r2.7.1/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml
"Specifies the maximum amount of bandwidth that each datanode can utilize for the balancing purpose in term of the number of bytes per second."
Created 11-17-2015 03:14 PM
That's what I was looking for. Thanks. Recovery time is a concern for a bank I'm working with because there's a window of exposure to data loss while the data is under-replicated. This would be more useful if expressed as a % of capacity. I wonder if it would be worth asking for an enhancement?
Created 11-17-2015 04:36 PM
Makes sense, but this is going to be difficult to implement. Do you mean the % of available capacity or % of general network capacity? I assume you mean the % of currently available capacity, which is changing depending on the jobs that are running. We would need a way to predict the volume of files that are going to be transferred,..... The result would be an ever changing bandwidth...Maybe it makes sense to specify a minimum and maximum bandwidth, yarn gets priority and can use the full capacity.
I need to think about that a bit more... 🙂
Created 11-17-2015 04:36 PM
I'd definitely open a feature enhancement, so that we can get engineering's input on that as well. Please keep me in the loop.
Created 11-18-2015 01:46 PM
That would be great, thanks. Please shoot me the link if you do file an enhancement. My intuition is that granting a percentage of total cluster capacity 0 to 100 would make sense, but perhaps not for the entire replication job, but only to clear the must urgent queue. Some customers really will want nothing done until safety is restored. Banks especially have all kinds of mandates, requirement, consent decrees, etc., that produce what seem from the outside to be unreasonable demands.