I'm running the sandbox on a VirtualBox virtual machine, this is a single-node cluster with a replication factor of 1.
After deleting files in the Hadoop file system and removing them from the trash, I don't get disk space back even after waiting for a while. I tried to use:
[hdfs@sandbox ~]$ hadoop fs -expunge [hdfs@sandbox ~]$
When I use hdfs dfsadmin -report, I get:
[hdfs@sandbox ~]$ hdfs dfsadmin -report Configured Capacity: 45103345664 (42.01 GB) Present Capacity: 25068261376 (23.35 GB) DFS Remaining: 2002014208 (1.86 GB) DFS Used: 23066247168 (21.48 GB) DFS Used%: 92.01% Under replicated blocks: 70 Blocks with corrupt replicas: 0 Missing blocks: 0 Missing blocks (with replication factor 1): 0 ------------------------------------------------- Live datanodes (1): Name: 172.17.0.2:50010 (sandbox.hortonworks.com) Hostname: sandbox.hortonworks.com Decommission Status : Normal Configured Capacity: 45103345664 (42.01 GB) DFS Used: 23066247168 (21.48 GB) Non DFS Used: 20035084288 (18.66 GB) DFS Remaining: 2002014208 (1.86 GB) DFS Used%: 51.14% DFS Remaining%: 4.44% Configured Cache Capacity: 0 (0 B) Cache Used: 0 (0 B) Cache Remaining: 0 (0 B) Cache Used%: 100.00% Cache Remaining%: 0.00% Xceivers: 4 Last contact: Thu Apr 06 13:36:57 UTC 2017
As you can see, it says that I use 21.48 GB. However, when I execute this other command I get a total of ~11.4GB
[hdfs@sandbox ~]$ hdfs dfs -du -h / 0 /app-logs 181.2 M /apps 0 /ats 9.5 G /demo 869.1 M /hdp 0 /mapred 0 /mr-history 269.2 M /ranger 6.0 K /spark-history 24.9 K /spark2-history 8.2 K /tmp 656.4 M /user [hdfs@sandbox ~]$
The disk usage is the same as before the deletion
I found a topic about the same issue . However, I don't have any snapshots.
[hdfs@sandbox ~]$ hdfs lsSnapshottableDir [hdfs@sandbox ~]$
How could I reclaim this disk usage ?
That's strange, when I started my virtual machine today, the disk usage has been reclaimed and I got 10GB back. I guess it reached the trash time interval which was set on 360 minutes. However, I thought emptying the bin doesn't use this configuration.
When running your command I get :
[root@sandbox ~]# du -hsx * | sort -rh | head -10 368K blueprint.json 12K jce_policy-8.zip 8.0K install.log 4.0K sandbox.info 4.0K install.log.syslog 4.0K hdp 4.0K build.out 4.0K anaconda-ks.cfg 0 start_hbase.sh 0 start_ambari.sh [root@sandbox ~]#
So I guess the non-DFS used is just reserved space.
Exactly. Non DFS is a reserved space allocated for hadoop services. Which means it will not used for data storage but however this space will be used by hadoop like an intermediate layer for various process/jobs which are triggered in hadoop. Similar to staging layer in typical ETL processing.
Thank you for your explanation. You seem to know HDFS pretty well, I take this opportunity to ask you something else (but related).
I'm trying to write files on a HDFS using the webhdfs RESTAPI part by part. When I define a small part (~5MB), I can see the remaining disk space decreases in relation to my upload. However the non-HDFS is also consumed while uploading but much faster. Because of that, the non-DFS reaches 0% and the upload stops.
After the upload, the non-HDFS increases and reach 18.7GB again...
Here are some data :
File to upload : 2.2GB / Remaining : 9,9GB / non-HDFS used : 18,7GB
Surprisingly, the non-HDFS used reaches 0GB while I upload a 2.2GB file. It doesn't decrease so much when I define a larger part (~50MB). Is it a cache problem ? I tried to use the "buffersize" in my request (corresponding to the part size) but it doesn't seem to change anything.
Non DFS space will be flushed out in if more memory is needed for processing few jobs. Also there will be an interval set in the configuration for clean up as well. Depending on that the memory will be cleared and used by hadoop. In your case when processing 50MB, it doesn't cost much memory than compared to processing 2.2 GB. Based on the size of file used for processing memory will be allocated/de-allocated. I guess it might answer your question.
Below link might give some insights on allocation of memory.
I've finally solved the problem by using the blocksize parameter in the HTTP request. By setting the blocksize to a lower value, the system doesn't overload. I guess it was because the system created temporarily blocks of 64Mb with 5Mb of data. After a while the non-HDFS was overloaded and could not create more temporarily blocks.
I hope I'm clear enough.
Glad you have solved. Be careful in choosing block size if you are going to deal with large chuck of data then its better to choose a higher value. But then if your just getting know few stuffs in HDFS for small amount of data then smaller block size wont affect the process.