@Madhura Mhatre
The hdfs balancer utility that analyzes block placement and balances data across the DataNodes. The balancer moves blocks until the cluster is deemed to be balanced.
A threshold parameter is a float number between 0 and 100 (12.5 for instance). From the average cluster utilization (about 50% in the graph below), the balancer process will try to converge all data nodes' usage in the range [average - threshold, average + threshold]. In the current example:
- Higher (average + threshold): 60% if run with the default threshold (10%)
- Lower (average - threshold): 40%
You can easily notice that the smaller your threshold, the more balanced your data nodes will be. For very small threshold, the cluster may not be able to reach the balanced state if other clients concurrently write and delete data in the cluster.
A threshold of 15 should be okay
$ hdfs balancer -threshold 15
or by giving the list of datanodes
$ hdfs balancer -threshold 15 -include hostname1,hostname2,hostname3
HTH