Support Questions

Find answers, ask questions, and share your expertise
Celebrating as our community reaches 100,000 members! Thank you!

What threshold we should keep to Rebalancing HDFS and make it finish quicker?



I want to rebalance cluster, but I am not sure how this threshold value is calculated?

What threshold I should keep to finish it quickly? I am not looking for a good balance right now. But one of our datanode is now over 80% used and we do want to reduce it to atleast 60% or 70%

Can you please tell me what threshold we should keep and how I can monitor HDFS balancing is finished or not in ambari?



Master Mentor

@Madhura Mhatre

The hdfs balancer utility that analyzes block placement and balances data across the DataNodes. The balancer moves blocks until the cluster is deemed to be balanced.

A threshold parameter is a float number between 0 and 100 (12.5 for instance). From the average cluster utilization (about 50% in the graph below), the balancer process will try to converge all data nodes' usage in the range [average - threshold, average + threshold]. In the current example:
- Higher (average + threshold): 60% if run with the default threshold (10%)
- Lower (average - threshold): 40%

You can easily notice that the smaller your threshold, the more balanced your data nodes will be. For very small threshold, the cluster may not be able to reach the balanced state if other clients concurrently write and delete data in the cluster.

A threshold of 15 should be okay

$ hdfs balancer -threshold 15

or by giving the list of datanodes

$ hdfs balancer -threshold 15 -include hostname1,hostname2,hostname3