I have a cluster of 8 Data Nodes and total 42TB. Replication is set 2. After loaded ~10TB data, I found out that only ~7TB left. I then deleted 1TB data with "-skipTrash" option, but did not see extra disk space is freed. Following is my disk usage:
hdfs@msl-dpe-perf87:/home/harry.li/tpcds_5.db$ hdfs dfs -df -h Filesystem Size Used Available Use% hdfs://msl-dpe-perf88.msl.lab:8020 42.4 T 32.7 T 7.3 T 77% hdfs@msl-dpe-perf87:/home/harry.li/tpcds_5.db$ hdfs dfs -du -h / 9.0 T /TPCDS 979.2 M /app-logs 477.8 G /apps 0 /ats 918.2 M /hdp 0 /mapred 8.4 M /mr-history 0 /spark-history 5.7 M /spark2-history 2.4 K /tmp 105.7 G /user
1. The math here seems not add up. With ~10TB data (replication2), I should still have at least 20TB left. Why I have only 7TB left
2. Why deleting data did not free up the disk space?
@Harry Li : Space used depends on multiple factors like the file size, block size and replication factor. What was to the average file size of data? It might be possible that your block size set is too big compared to files you are adding. Can you do hadoop listing on TPCDS directory?