Created 09-17-2021 02:29 AM
Hi,
I have 6 disks installed on each server with 3 data nodes. I am having 3 disks of 1TB and 3 disk of 4TB. Now what is happening that the smaller disk 1 TB are getting full frequently and Impala stops working. Is there a way that I can replicate those disks with some percent of free space?
Like none of the 6 disk should exceed 80% of usage until the balancer completes.
Created 09-23-2021 07:51 AM
HDFS data might not always be distributed uniformly across DataNodes. One common reason is addition of new DataNodes to an existing cluster. HDFS provides a balancer utility that analyzes block placement and balances data across the DataNodes. The balancer moves blocks until the cluster is deemed to be balanced, which means that the utilization of every DataNode (ratio of used space on the node to total capacity of the node) differs from the utilization of the cluster (ratio of used space on the cluster to total capacity of the cluster) by no more than a given threshold percentage. The balancer does not balance between individual volumes on a single DataNode.
To free up the spaces in particular datanodes. You can use a block distribution application to pin its block replicas to particular datanodes so that the pinned replicas are not moved for cluster balancing.
Created 09-17-2021 03:09 AM
Hi @Chetankumar ,
Given you have heterogeneous storage & HDFS follows rack topology to balance the blocks across the datanodes.
Currently the DataNode Volume uses Round Robin policy, we change it to Available Space policy. This means new data will be written to the lesser used disks.By doing that it will chose datanode based on available space. This can help in your case
You can avail the below settings in HDFS -
CM->HDFS-> config -> DataNode Volume Choosing Policy -> change to Available Space
Save changes and restart datanodes.
If that helps, Please feel free to mark the post as Accepted solution.
regards,
Vipin
Created 09-17-2021 03:10 AM
Hello
If you have unbalanced disks in cluster please use the interdisk balancer. So usually it would be in RoundRobin fashion and since few disks are smaller when compared to other we are running into issues.
Please refer below doc:
https://blog.cloudera.com/how-to-use-the-new-hdfs-intra-datanode-disk-balancer-in-apache-hadoop/
We can use the parameter for available space.
Usually the HDFS balancer uses DataNode balance by specified %. So it considers the overall usage of the DataNode rather than the individual disks on the DataNode.
Created 09-23-2021 07:51 AM
HDFS data might not always be distributed uniformly across DataNodes. One common reason is addition of new DataNodes to an existing cluster. HDFS provides a balancer utility that analyzes block placement and balances data across the DataNodes. The balancer moves blocks until the cluster is deemed to be balanced, which means that the utilization of every DataNode (ratio of used space on the node to total capacity of the node) differs from the utilization of the cluster (ratio of used space on the cluster to total capacity of the cluster) by no more than a given threshold percentage. The balancer does not balance between individual volumes on a single DataNode.
To free up the spaces in particular datanodes. You can use a block distribution application to pin its block replicas to particular datanodes so that the pinned replicas are not moved for cluster balancing.