PROBLEM: Balancer fails in few minutes without any block movement.
SYMPTOMS: Following are the messages balancer exits with:-
16/11/22 07:08:29 DEBUG ipc.Client: IPC Client (280134559) connection to ma2-gbit-lnn51.corp.apple.com/10.184.67.21:8020 from hdfs-BD_TEST2@HADOOP.GCSKDC.CORP.APPLE.COM got value #1193
16/11/22 07:08:29 DEBUG ipc.ProtobufRpcEngine: Call: getBlocks took 2486ms
No block has been moved for 5 iterations. Exiting...Nov 22, 2016 7:08:29 AM
4 0 B 35.86 TB 200 GB
ROOT CAUSE: The rack distribution looked like below:-
/default-rack : 91
/Example1 : 18
/Example2 : 2
The 100% utilized nodes which we were trying to balance to create space were those 20 nodes registered with racks /Example1 and /Example2.Thus based on following rack awareness rules in balancer (rule#3 for this issue) for block placement, it was not at all possible for even a single block to move compromising fault tolerance.
/** * Decide if the block is a good candidate to be moved from source to target.
* A block is a good candidate if
* 1. the block is not in the process of being moved/has not been moved;
* 2. the block does not have a replica on the target;
* 3. doing the move does not reduce the number of racks that the block has */
SOLUTION: Distribute nodes evenly across all racks.If this is not possible add additional storage to respective nodes OR add new datanodes to the respective racks.