Created on 07-24-2018 11:13 AM - edited 09-16-2022 06:30 AM
Does Hadoop handle storage exhaustion on one of the data node in the cluster?How?
Created 07-24-2018 11:40 AM
Storage Exhaustion is an uneven distribution of data across the data nodes in the cluster.
.
The cause of storage Exhaustion in a cluster is due to the
Addition and removal of the data nodes in the cluster.
Multiple write and delete operations.
Hadoop provides a tool called disk balancer which re-balances the data across data nodes by moving blocks from overbalanced data node to underbalanced data node until a threshold value is maintained. Before moving blocks, disk balancer plans how much data to be transferred between the data nodes. It uses Round-robin and Available space policy for choosing the destination disk. Initially, disk balancer is not enabled by default. To enable a disk balancer
1) open hdfs-site.xml which is locate in (Hadoop-2.5.0-cdh5.3.2/etc/Hadoop)
2) set the property dfs.disk.balancer.enabled to True
We can use the commands start-balancer.sh to invoke the balancer and we can also run it by using hdfs - balancer
Its suggested to run the balancer after adding new nodes to the cluster.
Created 07-24-2018 06:26 PM
Contrary to answer by @Harshali Patel, exhaustion is not defined as an uneven distribution, it is rather a cause of it.
A datanode has a property that you can set which defines a threshold of data must be reserved for the OS on that server. Once that limit is exceeded, the datanode process will stop and log an error telling you to delete some files from it. HDFS will continue to function with the other datanodes.
The balancer can be ran to keep storage space healthy and even.