Support Questions

Find answers, ask questions, and share your expertise

Cleaning /dfs/dn sub-directories to free disk spaces

avatar
Expert Contributor

Hello everyone,

 

I am running into an issue, the dfs/dn is consuming all the disk space in a distributed cluster. My cluster has four nodes with each of 100HDD disk space. Apparently on Cloudera manager the consumed space is 200 GBs but when i check on HDFS only 50 gbs is consumed. Can any one help me cleaning up the directories ? or if cleaning is not an option how do you compress it without the need to scale up ?

 

Thanks

1 ACCEPTED SOLUTION

avatar
Mentor
This may be a very basic question but I ask because it is unclear from the data you've posted: Have you accounted for replication? 50 GiB of HDFS file lengths summed up (hdfs dfs -du values) with 3x replication would be ~150 GiB of actual used space on the physical storage.

The /dfs/dn is where the file block replicas are stored. Nothing unnecessary is retained in HDFS, however a common overlooked item is older snapshots retaining data blocks that are no longer necessary. Deleting such snapshots frees up the occupied space based on HDFS files deleted after the snapshot was made.

If you're unable to grow your cluster, but need to store more data, then you may sacrifice availability of data by lowering your default replication to 2x or 1x (via dfs.replication config for new data writes, and hdfs dfs -setrep n for existing data).

View solution in original post

1 REPLY 1

avatar
Mentor
This may be a very basic question but I ask because it is unclear from the data you've posted: Have you accounted for replication? 50 GiB of HDFS file lengths summed up (hdfs dfs -du values) with 3x replication would be ~150 GiB of actual used space on the physical storage.

The /dfs/dn is where the file block replicas are stored. Nothing unnecessary is retained in HDFS, however a common overlooked item is older snapshots retaining data blocks that are no longer necessary. Deleting such snapshots frees up the occupied space based on HDFS files deleted after the snapshot was made.

If you're unable to grow your cluster, but need to store more data, then you may sacrifice availability of data by lowering your default replication to 2x or 1x (via dfs.replication config for new data writes, and hdfs dfs -setrep n for existing data).