Community Articles

sze · ‎07-08-2016

The Balancer runs in iterations for balancing a cluster. An iteration consists of four steps. We discuss each step in detail below.

Step 1, Storage Group Classification

Balancer first invokes the getLiveDatanodeStorageReport() rpc to the Namenode in order to get the storage report for all the storage devices in all Datanodes. The storage report contains storage utilization information such as capacity, DFS used space, remaining space, etc. for each storage device in each Datanode.

A Datanode may contain multiple storage devices and the storage devices may have different storage types. A storage group G(i,T) is defined to be the group of all the storage devices with the same storage type T in Datanode i. For example, G(i,DISK) is the storage group of all the disk storage devices in Datanode i.

For each storage type T in each Datanode i, Balancer computes Storage Group Utilization (%)

U(i,T) = 100% * (storage group used space)/(storage group capacity),

and Average Utilization (%)

U(avg,T) = 100% * (sum of all used spaces)/(sum of all capacities).

Let Δ be the threshold parameter (default is 10%) and G(i,T) be the storage group with storage type T in Datanode i. Then, the utilization classes are defined as following.

                        Over-Utilized:  { G(i,T) :  U(avg,T) + Δ < U(i,T) },
Average + Threshold ----------------------------------------------------------------------------
                        Above-Average:  { G(i,T) :  U(avg,T) < U(i,T) <= U(avg,T) + Δ },
Average ----------------------------------------------------------------------------------------
                        Below-Average:  { G(i,T) :  U(avg,T) - Δ <= U(i,T) <= U(avg,T) },
Average - Threshold ----------------------------------------------------------------------------
                        Under-Utilized: { G(i,T) :  U(i,T) < U(avg,T) - Δ }.

Roughly speaking, a storage group is over-utilized (under-utilized) if its utilization is larger (smaller) than the average plus (minus) threshold. A storage group is above-average (below-average) if its utilization is larger (smaller) than average but within the threshold.

If there are no over-utilized storage groups and no under-utilized storage groups, the cluster is said to be balanced. The Balancer terminates with a SUCCESS state. Otherwise, it continues with the following steps.

Step 2, Storage Group Pairing

The Balancer selects over-utilized or above-average storage groups as sources, and under-utilized or below-average storage groups as targets. It pairs a source storage group with a target storage group (source -> target) in the following priority order.

Same-Rack (the source and the target must reside in the same rack)
1. Over-Utilized -> Under-Utilized
2. Over-Utilized -> Below-Average
3. Above-Average -> Under-Utilized
Any
1. Over-Utilized -> Under-Utilized
2. Over-Utilized -> Below-Average
3. Above-Average -> Under-Utilized

Step 3, Block Move Scheduling

For each source-target pair, the Balancer chooses block replicas from the source storage groups and then schedules block moves. A block replica in a source storage group is a good candidate if it satisfies all the conditions below.

Its storage type is the same as the target storage type.
It is not already scheduled.
The target does not already have the same block replica.
The number of racks of the block is not reduced after the move.

Note that, logically, the Balancer schedules a block replica to be “moved” from a source storage group to a target storage group. In practice, since a block usually has multiple replicas, the block move can be done by first copying the replica from a proxy, which can be any storage group containing one of the replicas of the block, to the target storage group, and then deleting the replica from the source storage group.

After a candidate block in the source datanode is chosen, the Balancer selects a storage group containing the same replica as the proxy. The Balancer selects the closest storage group as the proxy in order to minimize the network traffic.

When it is impossible to schedule any move, the Balancer terminates with a NO_MOVE_BLOCK state.

Step 4, Block Move Execution

The Balancer dispatches a scheduled block move by invoking the DataTransferProtocol.replaceBlock(..) method to the target datanode. It specifies the proxy and the source as delete-hint in the method call. Then, the target datanode copies the replica directly from the proxy to its local storage. Once the copying process has been completed, the target datanode reports the new replica to the Namenode with the delete-hint. Namenode uses the delete-hint to delete the extra replica, i.e. delete the replica stored in the source.

After all block moves are dispatched, the Balancer waits until all the moves are completed. Then, the Balancer continues to run a new iteration and repeats all these steps. In case that all the scheduled moves fail for 5 consecutive iterations, the Balancer terminates with a NO_MOVE_PROGRESS state.

We end this article by listing the exit states in the following table.

Exit States

State	Code	Description
SUCCESS	0	The cluster is balanced (i.e. no over/under-utilized storage groups) regarding to the given threshold.
ALREADY_RUNNING	-1	Another Balancer is running.
NO_MOVE_BLOCK	-2	The Balancer is not able to schedule any move.
NO_MOVE_PROGRESS	-3	All the scheduled moves fail for 5 consecutive iterations
IO_EXCEPTION	-4	An IOException being thrown
ILLEGAL_ARGUMENTS	-5	Found an illegal argument in the command or in the configuration.
INTERRUPTED	-6	The Balancer process is interrupted.
UNFINALIZED_UPGRADE	-7	The cluster is being upgraded.

yogesh_darji99 · ‎04-25-2017

Thanks for nice explanation, I am facing an issue here, first my replication factor is 1, and second my DN1 has 1.13GB and my DN2 has 1.89GB, and both have capacity of 17.72GB, can you please tell me why it is not balancing automatically wrt 10% default threhold? Thanks for help

VijayM · ‎10-10-2017

Hello Team,

I already have balanced cluster where all datanodes have same space utilization. 6*4TB storage across all datanodes and all are 80% utilized.

Since we do not have extra node to add for now we are adding extra strorage of 6*4TB across all datanodes.

My question is:

1. Do i need to re-balance cluster?

2. If yes Do i still need to run balancer with default threshold of 10% or any else?

- Vijay Mishra

Cloudera Community