I had 34 data nodes that are identical with regards to CPU,. memory, storage.
I recently added 3 data nodes with the same CPU and memory but less storage (9TB as opposed to 11TB on the older nodes)
I ran hdfs balancer, which chugged for a while and moved data to the 3 new nodes.
Problem is, that 3 file systems on each of the 3 new nodes do not seem to be getting data.
I have run the balancer from the CLI and the output states the cluster is balanced.The older nodes (with more storage) are at about 50% disk utilization, the 3 new nodes are at about 30% disk utilization
That because hdfs balancer is operating slowly with stored data in each disk. The balancer is calculating of moving data from datanodeN to datanodeN. You could set the parameters like threshold(HDFS > Service Actions : Rebalance HDFS), bandwidth (HDFS > Advanced hdfs-site : dfs.datanode.balance.bandwidthPerSec). This process of balancing is not continuously executing until well balanced all datanode's disks. For example, just one process in balancing time, 5 iterations and move data blocks to another datanodes disks, and then sleep for about 2 and a half hours. To all datanodes disks is quite balanced, of course it would depend on all datanode's consumed disks size and network bandwidth, it takes more than a week.