Created on 07-07-2016 04:14 AM
The original configurations and the original CLI options are respectively discussed in Section 1 & 2. The new configurations and the new CLI options are respectively discussed in Section 3 & 4.
dfs.datanode.balance.max.concurrent.moves, default is 5
This configuration is to limit the maximum number of concurrent block moves that a Datanode is allowed for balancing the cluster. If this configuration is set in a Datanode, the Datanode will throw an exception if the limit is exceeded. If it is set in the Balancer, the Balancer will schedule concurrent block movements within the limit. Note that the Datanode setting and the Balancer setting could be different. Since both settings impose a restriction, the effective setting is the minimum of them.
Before HDFS-9214, changing this configuration in a Datanode requires restarting the Datanode. Therefore, it is recommended to set it to the highest possible value in Datanodes and adjust the runtime value in Balancer in order to gain the flexibility. HDFS-9214 allows this conf to be reconfigured without Datanode restart; below are the steps for re-configuring a Datanode:
hdfs dfsadmin -reconfig datanode <dn_addr>:<ipc_port> start
For example, suppose a Datanode has 12 disks. Then, this configuration can be set to 24, a small multiple of the number of disks, in the Datanodes. A higher value may not be useful but only increases disk contention. If the Balancer is running in a maintenance window, the setting in Balancer can be the same, i.e. 24, in order to utilize all the bandwidth. However, if the Balancer is running at same time with some other jobs, it should be set to a smaller value, say 5, in the Balancer so that there is bandwidth available for the jobs.
dfs.datanode.balance.bandwidthPerSec, default is 1048576 (=1MB/s)
This configuration is to limit the bandwidth in each Datanode using for balancing the cluster. Unlike dfs.datanode.balance.max.concurrent.moves, changing this configuration does not require restarting Datanodes. It can be done by the following command.
dfsadmin -setBalancerBandwidth <bandwidth in bytes per second>
dfs.balancer.moverThreads, default is 1000
It is the number of threads in Balancer for moving blocks. Each block move requires a thread so that this configuration limits the number of total concurrent moves for balancing in the entire cluster.
[-policy <policy>]
Describe how to determine if a cluster is balanced. The two supported policies are:
Note that the blockpool policy is stricter than the datanode policy in the sense that the blockpool policy requirement implies the datanode policy requirement. The default policy is datanode.
[-threshold <threshold>]
Specify a number in the close interval [1.0, 100.0] representing the acceptable threshold of the percentage of storage capacity so that storage utilization outside the average +/- the threshold is considered as over/under utilized. The default threshold is 10.0, i.e. 10%.
[-blockpools <comma-separated list of blockpool ids>]
The Balancer will run on only the block pools specified in this list. When the list is empty, the Balancer runs on all the existing block pools. The default value is an empty list.
[-include [-f <hosts-file> | <comma-separated list of hosts>]]
When the include list is nonempty, only the datanodes specified in the list will be balanced by the Balancer. An empty include list means including all the datanodes in the cluster. The default value is an empty list.
[-exclude [-f <hosts-file> | <comma-separated list of hosts>]]
The datanodes specified in the exclude list will be excluded so that the Balancer will not balance those datanodes. An empty exclude list means that no datanodes are excluded. When a datanode is specified in both in the include list and the exclude list, the datanode will be excluded. The default value is an empty list.
[-idleiterations <idleiterations>]
It specifies the number of consecutive iterations (-1 for infinite) that no blocks have been moved before the Balancer terminates with the NO_MOVE_PROGRESS exit code. The default is 5.
[-runDuringUpgrade]
If it is specified, the Balancer will run even if there is an ongoing HDFS upgrade; otherwise, the Balancer terminates with the UNFINALIZED_UPGRADE exit code. When there is no ongoing upgrade, this option has no effect. It is usually undesirable to run Balancer during upgrade since, in order to support rollback, the blocks being deleted from HDFS will be moved to the internal trash directory in datanodes but not actually deleted. Therefore, running Balancer during upgrading cannot reduce the usage of any datanode storage.
dfs.balancer.max-size-to-move, default is 10737418240 (=10GB)
In each iteration, Balancer chooses datanodes in pairs and then moves data between the datanode pairs. This configuration is to limit the maximum size of data that the Balancer will move between a chosen datanode pair. When the network and disk are not saturated, increasing this configuration can increase the data transfer between datanode pair in each iteration while the duration of an iteration remains about the same.
dfs.balancer.getBlocks.size, default is 2147483648 (=2GB) dfs.balancer.getBlocks.min-block-size, default is 10485760 (=10MB)
After Balancer decided to move a certain amount of data between two datanodes (source and destination), it repeatedly invokes the getBlocks(..) rpc to the Namenode in order to get lists of blocks from the source datanode until the required amount of data is scheduled. dfs.balancer.getBlocks.size is the total data size of the block list returned by a getBlocks(..) rpc. dfs.balancer.getBlocks.min-block-size is the minimum block size that the blocks will be used for balancing the cluster.
dfs.datanode.block-pinning.enabled, default is false
When creating a file, a user application may specify a list of favorable datanodes via the file creation API in DistributedFileSystem. Namenode uses best effort allocating blocks to the favorable datanodes. When dfs.datanode.block-pinning.enabled is set to true, if a block replica is written to a favorable datanode, it will be “pinned” to that datanode. The pinned replicas will not be moved for cluster balancing in order to keep them stored in the specified favorable datanodes. This feature is useful for block distribution aware user applications such as HBase.
[-source [-f <hosts-file> | <comma-separated list of hosts>]]
The new -source option allows specifying a source datanode list so that the Balancer selects blocks to move from only those datanodes. When the list is empty, all the datanodes can be chosen as a source. The default value is an empty list.
The option can be used to free up the space of some particular datanodes in the cluster. Without the -source option, the Balancer can be inefficient in some cases. Below is an example.
Datanodes (with the same capacity) | Utilization | Rack |
D1 | 95% | A |
D2 | 30% | B |
D3, D4, D5 | 0% | B |
In the table above, the average utilization is 25% so that D2 is within the 10% threshold. It is unnecessary to move any blocks from or to D2. Without specifying the source nodes, Balancer first moves blocks from D2 to D3, D4 and D5, since they are under the same rack, and then moves blocks from D1 to D3, D4 and D5. By specifying D1 as the source node, Balancer directly moves blocks from D1 to D3, D4 and D5.
This is the second article of the HDFS Balancer serious. We will explain the algorithm deployed by the Balancer for balancing the cluster in the next article.
Created on 07-08-2016 05:25 PM
Nice series @szetszwo! Thanks for writing this up. It will be a very useful resource for administrators.
Created on 06-23-2017 04:29 PM
Thanks for the the very useful article.
Will there be any follow ups coming? I am particularly interested in the changes that came about from https://issues.apache.org/jira/browse/HDFS-8818 and how settings like dfs.balancer.moverThreads needs to be increased from default when balancing a large number of unbalanced nodes. (e.g the setting used in the comment in HDFS-8188 herehttps://issues.apache.org/jira/browse/HDFS-8818?focusedCommentId=15997429&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15997429)