Member since
07-05-2016
25
Posts
45
Kudos Received
3
Solutions
07-08-2016
11:08 AM
9 Kudos
The Balancer runs in iterations for balancing a cluster. An
iteration consists of four steps. We discuss each step in detail below. Step 1, Storage Group Classification Balancer first invokes the getLiveDatanodeStorageReport() rpc to the Namenode in order to get the storage report for all the storage devices in all Datanodes. The storage report contains storage utilization information such as capacity, DFS used space, remaining space, etc. for each storage device in each Datanode. A Datanode may contain multiple storage devices and the storage devices may have different storage types. A storage group G(i,T) is defined to be the group of all the storage devices with the same storage type T in Datanode i. For example, G(i,DISK) is the storage group of all the disk storage devices in Datanode i. For each storage type T in each Datanode i, Balancer computes Storage Group Utilization (%) U(i,T) = 100% * (storage group used space)/(storage group capacity), and Average Utilization (%) U(avg,T) = 100% * (sum of all used spaces)/(sum of all capacities). Let Δ be the threshold parameter (default is 10%) and G(i,T) be the storage group with storage type T in Datanode i. Then, the utilization classes are defined as following. Over-Utilized: { G(i,T) : U(avg,T) + Δ < U(i,T) },
Average + Threshold ----------------------------------------------------------------------------
Above-Average: { G(i,T) : U(avg,T) < U(i,T) <= U(avg,T) + Δ },
Average ----------------------------------------------------------------------------------------
Below-Average: { G(i,T) : U(avg,T) - Δ <= U(i,T) <= U(avg,T) },
Average - Threshold ----------------------------------------------------------------------------
Under-Utilized: { G(i,T) : U(i,T) < U(avg,T) - Δ }. Roughly speaking, a storage group is over-utilized (under-utilized) if its utilization is larger (smaller) than the average plus (minus) threshold. A storage group is above-average (below-average) if its utilization is larger (smaller) than average but within the threshold. If there are no over-utilized storage groups and no under-utilized storage groups, the cluster is said to be balanced. The Balancer terminates with a SUCCESS state. Otherwise, it continues with the following steps. Step 2, Storage Group Pairing The Balancer selects over-utilized or above-average storage groups as sources, and under-utilized or below-average storage groups as targets. It pairs a source storage group with a target storage group (source -> target) in the following priority order.
Same-Rack (the source and the target must reside in the same rack)
Over-Utilized -> Under-Utilized Over-Utilized -> Below-Average Above-Average -> Under-Utilized Any
Over-Utilized -> Under-Utilized Over-Utilized -> Below-Average Above-Average -> Under-Utilized Step 3, Block Move Scheduling For each source-target pair, the Balancer chooses block replicas from the source storage groups and then schedules block moves. A block replica in a source storage group is a good candidate if it satisfies all the conditions below.
Its storage type is the same as the target storage type. It is not already scheduled. The target does not already have the same block replica. The number of racks of the block is not reduced after the move. Note that, logically, the Balancer schedules a block replica to be “moved” from a source storage group to a target storage group. In practice, since a block usually has multiple replicas, the block move can be done by first copying the replica from a proxy, which can be any storage group containing one of the replicas of the block, to the target storage group, and then deleting the replica from the source storage group. After a candidate block in the source datanode is chosen, the Balancer selects a storage group containing the same replica as the proxy. The Balancer selects the closest storage group as the proxy in order to minimize the network traffic. When it is impossible to schedule any move, the Balancer terminates with a NO_MOVE_BLOCK state. Step 4, Block Move Execution The Balancer dispatches a scheduled block move by invoking the DataTransferProtocol.replaceBlock(..) method to the target datanode. It specifies the proxy and the source as delete-hint in the method call. Then, the target datanode copies the replica directly from the proxy to its local storage. Once the copying process has been completed, the target datanode reports the new replica to the Namenode with the delete-hint. Namenode uses the delete-hint to delete the extra replica, i.e. delete the replica stored in the source. After all block moves are dispatched, the Balancer waits until all the moves are completed. Then, the Balancer continues to run a new iteration and repeats all these steps. In case that all the scheduled moves fail for 5 consecutive iterations, the Balancer terminates with a NO_MOVE_PROGRESS state. We end this article by listing the exit states in the following table. Exit States State Code Description SUCCESS 0 The cluster is balanced (i.e. no over/under-utilized storage groups) regarding to the given threshold. ALREADY_RUNNING -1 Another Balancer is running. NO_MOVE_BLOCK -2 The Balancer is not able to schedule any move. NO_MOVE_PROGRESS -3 All the scheduled moves fail for 5 consecutive iterations IO_EXCEPTION -4 An IOException being thrown ILLEGAL_ARGUMENTS -5 Found an illegal argument in the command or in the configuration. INTERRUPTED -6 The Balancer process is interrupted. UNFINALIZED_UPGRADE -7 The cluster is being upgraded.
... View more
Labels:
07-07-2016
04:14 AM
14 Kudos
The original configurations and the original CLI options are respectively discussed in Section 1 & 2. The new configurations and the new CLI options are respectively discussed in Section 3 & 4. 1. Original Configurations Concurrent Moves dfs.datanode.balance.max.concurrent.moves, default is 5 This configuration is to limit the maximum number of concurrent block moves that a Datanode is allowed for balancing the cluster. If this configuration is set in a Datanode, the Datanode will throw an exception if the limit is exceeded. If it is set in the Balancer, the Balancer will schedule concurrent block movements within the limit. Note that the Datanode setting and the Balancer setting could be different. Since both settings impose a restriction, the effective setting is the minimum of them. Before HDFS-9214, changing this configuration in a Datanode requires restarting the Datanode. Therefore, it is recommended to set it to the highest possible value in Datanodes and adjust the runtime value in Balancer in order to gain the flexibility. HDFS-9214 allows this conf to be reconfigured without Datanode restart; below are the steps for re-configuring a Datanode:
Change the value of dfs.datanode.balance.max.concurrent.moves in the configuration xml file stored in the Datanode machine. Start a reconfiguration task by the following command hdfs dfsadmin -reconfig datanode <dn_addr>:<ipc_port> start For example, suppose a Datanode has 12 disks. Then, this configuration can be set to 24, a small multiple of the number of disks, in the Datanodes. A higher value may not be useful but only increases disk contention. If the Balancer is running in a maintenance window, the setting in Balancer can be the same, i.e. 24, in order to utilize all the bandwidth. However, if the Balancer is running at same time with some other jobs, it should be set to a smaller value, say 5, in the Balancer so that there is bandwidth available for the jobs. Bandwidth dfs.datanode.balance.bandwidthPerSec, default is 1048576 (=1MB/s) This configuration is to limit the bandwidth in each Datanode using for balancing the cluster. Unlike dfs.datanode.balance.max.concurrent.moves, changing this configuration does not require restarting Datanodes. It can be done by the following command. dfsadmin -setBalancerBandwidth <bandwidth in bytes per second> Mover Threads dfs.balancer.moverThreads, default is 1000 It is the number of threads in Balancer for moving blocks. Each block move requires a thread so that this configuration limits the number of total concurrent moves for balancing in the entire cluster. 2 Original CLI Option Balancing Policy, Threshold & Block Pools [-policy <policy>] Describe how to determine if a cluster is balanced. The two supported policies are: blockpool: Cluster is balanced if each pool in each node is balanced. datanode: Cluster is balanced if each datanode is balanced. Note that the blockpool policy is stricter than the datanode policy in the sense that the blockpool policy requirement implies the datanode policy requirement. The default policy is datanode. [-threshold <threshold>] Specify a number in the close interval [1.0, 100.0] representing the acceptable threshold of the percentage of storage capacity so that storage utilization outside the average +/- the threshold is considered as over/under utilized. The default threshold is 10.0, i.e. 10%. [-blockpools <comma-separated list of blockpool ids>] The Balancer will run on only the block pools specified in this list. When the list is empty, the Balancer runs on all the existing block pools. The default value is an empty list. Include & Exclude Lists [-include [-f <hosts-file> | <comma-separated list of hosts>]] When the include list is nonempty, only the datanodes specified in the list will be balanced by the Balancer. An empty include list means including all the datanodes in the cluster. The default value is an empty list. [-exclude [-f <hosts-file> | <comma-separated list of hosts>]] The datanodes specified in the exclude list will be excluded so that the Balancer will not balance those datanodes. An empty exclude list means that no datanodes are excluded. When a datanode is specified in both in the include list and the exclude list, the datanode will be excluded. The default value is an empty list. Idle-Iterations & Run-During-Upgrade [-idleiterations <idleiterations>] It specifies the number of consecutive iterations (-1 for infinite) that no blocks have been moved before the Balancer terminates with the NO_MOVE_PROGRESS exit code. The default is 5. [-runDuringUpgrade] If it is specified, the Balancer will run even if there is an ongoing HDFS upgrade; otherwise, the Balancer terminates with the UNFINALIZED_UPGRADE exit code. When there is no ongoing upgrade, this option has no effect. It is usually undesirable to run Balancer during upgrade since, in order to support rollback, the blocks being deleted from HDFS will be moved to the internal trash directory in datanodes but not actually deleted. Therefore, running Balancer during upgrading cannot reduce the usage of any datanode storage. 3 New Configurations HDFS-8818: Allow Balancer to run faster dfs.balancer.max-size-to-move, default is 10737418240 (=10GB) In each iteration, Balancer chooses datanodes in pairs and then moves data between the datanode pairs. This configuration is to limit the maximum size of data that the Balancer will move between a chosen datanode pair. When the network and disk are not saturated, increasing this configuration can increase the data transfer between datanode pair in each iteration while the duration of an iteration remains about the same. HDFS-8824: Do not use small blocks for balancing the cluster dfs.balancer.getBlocks.size, default is 2147483648 (=2GB)
dfs.balancer.getBlocks.min-block-size, default is 10485760 (=10MB) After Balancer decided to move a certain amount of data between two datanodes (source and destination), it repeatedly invokes the getBlocks(..) rpc to the Namenode in order to get lists of blocks from the source datanode until the required amount of data is scheduled. dfs.balancer.getBlocks.size is the total data size of the block list returned by a getBlocks(..) rpc. dfs.balancer.getBlocks.min-block-size is the minimum block size that the blocks will be used for balancing the cluster. HDFS-6133: Block Pinning dfs.datanode.block-pinning.enabled, default is false When creating a file, a user application may specify a list of favorable datanodes via the file creation API in DistributedFileSystem. Namenode uses best effort allocating blocks to the favorable datanodes. When dfs.datanode.block-pinning.enabled is set to true, if a block replica is written to a favorable datanode, it will be “pinned” to that datanode. The pinned replicas will not be moved for cluster balancing in order to keep them stored in the specified favorable datanodes. This feature is useful for block distribution aware user applications such as HBase. 4 New CLI Option HDFS-8826: Source Datanodes [-source [-f <hosts-file> | <comma-separated list of hosts>]] The new -source option allows specifying a source datanode list so that the Balancer selects blocks to move from only those datanodes. When the list is empty, all the datanodes can be chosen as a source. The default value is an empty list. The option can be used to free up the space of some particular datanodes in the cluster. Without the -source option, the Balancer can be inefficient in some cases. Below is an example.
Datanodes (with the same capacity) Utilization Rack D1 95% A D2 30% B D3, D4, D5 0% B
In the table above, the average utilization is 25% so that D2 is within the 10% threshold. It is unnecessary to move any blocks from or to D2. Without specifying the source nodes, Balancer first moves blocks from D2 to D3, D4 and D5, since they are under the same rack, and then moves blocks from D1 to D3, D4 and D5. By specifying D1 as the source node, Balancer directly moves blocks from D1 to D3, D4 and D5. This is the second article of the HDFS Balancer serious. We will explain the algorithm deployed by the Balancer for balancing the cluster in the next article.
... View more
Labels:
07-06-2016
01:12 PM
10 Kudos
HDFS Balancer is a tool for balancing the data across the storage devices of a HDFS cluster. The Balancer was originally designed to run slowly so that the balancing activities do not affect the normal cluster activities and the running jobs. We have received feedback from the HDFS community that it is also desirable if Balancer can be configured to run faster. The use cases are listed below.
Free up the spaces from some nearly full datanodes. Move data to some newly added datanodes in order to utilize the new machines. Run Balancer when the cluster load is low or in a maintenance window, instead of running it as a background daemon. We have changed Balancer for addressing these new use cases. After the changes, Balancer is able to run 100x faster while it still can be configured to run slowly as before. In one of our tests, we were able to bump the performance from a few gigabytes per minute to a terabyte per minute. In addition, we added two new features -- source datanodes and block pinning. Users can specify the source datanodes so that they can free up the spaces in particular datanodes using Balancer. A block distribution aware user application can pin its block replicas to particular datanodes so that the pinned replicas will not be moved for cluster balancing. Why the data stored in HDFS is imbalanced? There are three major reasons. 1. Adding Datanodes When new datanodes are added to a cluster, newly created blocks will be written to these datanodes from time to time. However, the existing blocks will not be moved to them without using Balancer. 2. Client Behavior In some cases, a client application may not write data uniformly across the datanode machines. A client application may be skewed in writing data. It may always write to some particular machines but not the other machines. HBase is an example of such application. In some other cases, the client application is not skewed by design such as MapReduce/YARN jobs. However, the data is skewed so that some of the job tasks write significantly more data than the other tasks. When a Datanode receives the data directly from the client, it stores a copy to its local storage for preserving data locality. The datanodes receiving more data usually have a higher storage utilization. 3. Block Allocation in HDFS HDFS uses a constraint satisfaction algorithm to allocate file blocks. Once the constraints are satisfied, HDFS allocates a block by randomly selecting a storage device from the candidate set uniformly. For large clusters, the blocks are essentially allocated randomly in a uniform distribution, provided that the client applications write data to HDFS uniformly across the datanode machines. Note that uniform random allocation may not result in a uniform data distribution because of the randomness. It is usually not a problem when the cluster has sufficient space but the problem becomes serious when the cluster is nearly full. In the next article, we will explain the usage of the original configurations/CLI
options of the Balancer and as well the new configurations/CLI options
added by the recent enhancement.
... View more
Labels: