Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

How HDFS balancer works ?

avatar
Explorer

Hi,

I have 6 disks installed on each server with 3 data nodes. I am having 3 disks of 1TB and 3 disk of 4TB. Now what is happening that the smaller disk 1 TB are getting full frequently and Impala stops working. Is there a way that I can replicate those disks with some percent of free space?

Like none of the 6 disk  should exceed 80% of usage until the balancer completes.

1 ACCEPTED SOLUTION

avatar
Master Collaborator

HDFS data might not always be distributed uniformly across DataNodes. One common reason is addition of new DataNodes to an existing cluster. HDFS provides a balancer utility that analyzes block placement and balances data across the DataNodes. The balancer moves blocks until the cluster is deemed to be balanced, which means that the utilization of every DataNode (ratio of used space on the node to total capacity of the node) differs from the utilization of the cluster (ratio of used space on the cluster to total capacity of the cluster) by no more than a given threshold percentage. The balancer does not balance between individual volumes on a single DataNode.

 

To free up the spaces in particular datanodes. You can use a block distribution application to pin its block replicas to particular datanodes so that the pinned replicas are not moved for cluster balancing.

https://docs.cloudera.com/HDPDocuments/HDP2/HDP-2.6.0/bk_hdfs-administration/content/overview_hdfs_b...

View solution in original post

3 REPLIES 3

avatar
Expert Contributor

Hi @Chetankumar ,

 

Given you have heterogeneous storage & HDFS follows rack topology to balance the blocks across the datanodes.

 

Currently the DataNode Volume uses Round Robin policy, we change it to Available Space policy. This means new data will be written to the lesser used disks.By doing that it will chose datanode based on available space. This can help in your case

 

You can avail the below settings in HDFS - 

CM->HDFS-> config -> DataNode Volume Choosing Policy -> change to Available Space
Save changes and restart datanodes.

If that helps, Please feel free to mark the post as Accepted solution. 

 

regards,

Vipin

avatar
Cloudera Employee

Hello 

If you have unbalanced disks in cluster please use the interdisk balancer. So usually it would be in RoundRobin fashion and since few disks are smaller when compared to other we are running into issues. 

 

Please refer below doc:

https://blog.cloudera.com/how-to-use-the-new-hdfs-intra-datanode-disk-balancer-in-apache-hadoop/

 

We can use the parameter for available space. 

Usually the HDFS balancer uses DataNode balance by specified %. So it considers the overall usage of the DataNode rather than the individual disks on the DataNode. 

avatar
Master Collaborator

HDFS data might not always be distributed uniformly across DataNodes. One common reason is addition of new DataNodes to an existing cluster. HDFS provides a balancer utility that analyzes block placement and balances data across the DataNodes. The balancer moves blocks until the cluster is deemed to be balanced, which means that the utilization of every DataNode (ratio of used space on the node to total capacity of the node) differs from the utilization of the cluster (ratio of used space on the cluster to total capacity of the cluster) by no more than a given threshold percentage. The balancer does not balance between individual volumes on a single DataNode.

 

To free up the spaces in particular datanodes. You can use a block distribution application to pin its block replicas to particular datanodes so that the pinned replicas are not moved for cluster balancing.

https://docs.cloudera.com/HDPDocuments/HDP2/HDP-2.6.0/bk_hdfs-administration/content/overview_hdfs_b...