I am currently experiencing an issue with HDFS storage. We have (intentionally) deleted the majority of the data on our cluster. According to hdfs du, the total usage on HDFS is approximately 1TB. However, Ambari reports that the DFS used is 238.9TB. I can understand a small discrepancy here, for blocks that have not been deleted and such. However such a huge difference is worrying.
On top of this, a huge number of the underlying disks are at 100% full, and no amount of HDFS balancing changes this. HDFS has been incredibly unstable over the past few weeks, and it's possible that this is the underlying cause.
Is there any way I can safely clear this space? I don't mind losing data (we are repurposing this cluster) as long as HDFS remains stable. The issue with the full disks is my bigger concern, but improving that should also clear a lot of the falsely reported space on HDFS. Any advice will be appreciated.