Support Questions

Find answers, ask questions, and share your expertise

How to cleanup data from single HDFS disk?

avatar
Explorer

Hi,

We currently have 8 Datanodes which has two hdfs disk mounted on each of the datanodes.

One of the disk from the datanode is full. On this node HDFS(nodemanager) service was not coming up as it had below error

Upon checking into articles found out that we can setup- "DataNode failed disk tolerance" value to 1 and ignore this volume as its 100% full, But I would like to understand how I can cleanup the data from this disk?

I tried doing rebalance HDFS but which threshold value I should use?

StorageLocation [DISK]file:/hdfs/data1/hadoop/hdfs/data/
org.apache.hadoop.util.DiskChecker$DiskErrorException: Error checking directory /hdfs/data1/hadoop/hdfs/data
/dev/xvdl        50G   49G     0 100% /hdfs/data1
/dev/xvdk        50G   27G   21G  56% /hdfs/data0
3 REPLIES 3

avatar
Master Mentor

@Madhura Mhatre

HDFS balancer doesn't run at background, has to run manually. A threshold parameter is a float number between 0 and 100. From the average cluster utilization, the balancer process will try to converge all data node's usage in the range [average - threshold, average + threshold].

The default threshold is 10%

For example, if the cluster current utilization is 50% full, then higher usage data nodes will start to move data to lower usage nodes.

- Higher (average + threshold): 60% 
- Lower  (average - threshold): 40% 

TIP:

If you haven't balance your cluster for a long time, you should start by balancing with a higher threshold like 25, and then converging to a smaller target threshold like 10.

The smaller your threshold, the more balanced your data nodes will be. For very small threshold ie 2, the cluster may not be able to reach the balanced state if other clients concurrently write and delete data in the cluster. So in your scenario, I would advise a threshold of 25

$ hdfs balancer -threshold 25 

First, the balancer will pick data nodes with current usage above the higher threshold, and try to find blocks from these data nodes that could be copied into nodes with current usage below the lower threshold

Second, the balancer will select over-utilized nodes to move blocks to nodes with utilization below average. Third, the balancer will pick nodes with utilization above average to move data to under-utilized nodes.

In addition to that selection process, the balancer can also pick a proxy node if the source and the destination are not located in the same rack (i.e. a data node storing a replica of the block and located in the same rack than the destination). Yes, the balancer is rack aware and will generate very little rack-to-rack noise.

HTH

avatar
Explorer

@Geoffrey Shelton Okot Even after doing Rebalance HDFS to 25% threshold value, I still see the disk is 100% , IS hdfs not able to read it from the disk as its full, Also I had to set the DataNode failed disk tolerance to 1 as HDFS service was not coming up on that node. Can we delete the data manually from that particular disk? is there any way?

avatar
Master Mentor

@Madhura Mhatre

Can you share the output of the rebalancer? Have you tried reducing gradually from 25% - 20% -15% -10% to see if you are going some space

Depending on the purpose of the cluster, you shouldn't attempt this on production. You can remove the directory through the Ambari property: dfs.datanode.data.dir and do a rolling restart of data nodes. Make sure you don't have any missing/under-replicated blocks before restarting the new data node.

umount the /data1 format it remount