Support Questions

Find answers, ask questions, and share your expertise

Removing files in HDFS does not free up space

avatar
Explorer

Hi,

I'm running the sandbox on a VirtualBox virtual machine, this is a single-node cluster with a replication factor of 1.

After deleting files in the Hadoop file system and removing them from the trash, I don't get disk space back even after waiting for a while. I tried to use:

[hdfs@sandbox ~]$ hadoop fs -expunge                                                                                                                                                                               
[hdfs@sandbox ~]$                                                                                                                                                                                                  

When I use hdfs dfsadmin -report, I get:

[hdfs@sandbox ~]$ hdfs dfsadmin -report                                                                                                                                                                            
Configured Capacity: 45103345664 (42.01 GB)                                                                                                                                                                        
Present Capacity: 25068261376 (23.35 GB)                                                                                                                                                                           
DFS Remaining: 2002014208 (1.86 GB)                                                                                                                                                                                
DFS Used: 23066247168 (21.48 GB)                                                                                                                                                                                   
DFS Used%: 92.01%                                                                                                                                                                                                  
Under replicated blocks: 70                                                                                                                                                                                        
Blocks with corrupt replicas: 0                                                                                                                                                                                    
Missing blocks: 0                                                                                                                                                                                                  
Missing blocks (with replication factor 1): 0                                                                                                                                                                      
                                                                                                                                                                                                                   
-------------------------------------------------                                                                                                                                                                  
Live datanodes (1):                                                                                                                                                                                                
                                                                                                                                                                                                                   
Name: 172.17.0.2:50010 (sandbox.hortonworks.com)                                                                                                                                                                   
Hostname: sandbox.hortonworks.com                                                                                                                                                                                  
Decommission Status : Normal                                                                                                                                                                                       
Configured Capacity: 45103345664 (42.01 GB)                                                                                                                                                                        
DFS Used: 23066247168 (21.48 GB)                                                                                                                                                                                   
Non DFS Used: 20035084288 (18.66 GB)                                                                                                                                                                               
DFS Remaining: 2002014208 (1.86 GB)                                                                                                                                                                                
DFS Used%: 51.14%                                                                                                                                                                                                  
DFS Remaining%: 4.44%                                                                                                                                                                                              
Configured Cache Capacity: 0 (0 B)                                                                                                                                                                                 
Cache Used: 0 (0 B)                                                                                                                                                                                                
Cache Remaining: 0 (0 B)                                                                                                                                                                                           
Cache Used%: 100.00%                                                                                                                                                                                               
Cache Remaining%: 0.00%                                                                                                                                                                                            
Xceivers: 4                                                                                                                                                                                                        
Last contact: Thu Apr 06 13:36:57 UTC 2017                                                                                                                                                                                                                                                                                                                                                  

As you can see, it says that I use 21.48 GB. However, when I execute this other command I get a total of ~11.4GB

[hdfs@sandbox ~]$ hdfs dfs -du -h /                                                                                                                                                                                
0        /app-logs                                                                                                                                                                                                 
181.2 M  /apps                                                                                                                                                                                                     
0        /ats                                                                                                                                                                                                      
9.5 G    /demo                                                                                                                                                                                                     
869.1 M  /hdp                                                                                                                                                                                                      
0        /mapred                                                                                                                                                                                                   
0        /mr-history                                                                                                                                                                                               
269.2 M  /ranger                                                                                                                                                                                                   
6.0 K    /spark-history                                                                                                                                                                                            
24.9 K   /spark2-history                                                                                                                                                                                           
8.2 K    /tmp                                                                                                                                                                                                      
656.4 M  /user                                                                                                                                                                                                     
[hdfs@sandbox ~]$                                                                                                                                                                                                  

The disk usage is the same as before the deletion

I found a topic about the same issue . However, I don't have any snapshots.

[hdfs@sandbox ~]$ hdfs lsSnapshottableDir                                                                                                                                                                          
                                                                                                                                                                                                                   
[hdfs@sandbox ~]$                                                                                                                                                                                                  

How could I reclaim this disk usage ?

7 REPLIES 7

avatar

avatar
Explorer

Hi @Bala Vignesh N V

That's strange, when I started my virtual machine today, the disk usage has been reclaimed and I got 10GB back. I guess it reached the trash time interval which was set on 360 minutes. However, I thought emptying the bin doesn't use this configuration.

When running your command I get :

[root@sandbox ~]# du -hsx * | sort -rh | head -10                                                                                                                                                                  
368K    blueprint.json
12K     jce_policy-8.zip
8.0K    install.log
4.0K    sandbox.info
4.0K    install.log.syslog
4.0K    hdp
4.0K    build.out
4.0K    anaconda-ks.cfg
0       start_hbase.sh
0       start_ambari.sh
[root@sandbox ~]# 

So I guess the non-DFS used is just reserved space.

avatar
@Jonathan Samelson

Exactly. Non DFS is a reserved space allocated for hadoop services. Which means it will not used for data storage but however this space will be used by hadoop like an intermediate layer for various process/jobs which are triggered in hadoop. Similar to staging layer in typical ETL processing.

avatar
Explorer

@Bala Vignesh N V

Thank you for your explanation. You seem to know HDFS pretty well, I take this opportunity to ask you something else (but related).

I'm trying to write files on a HDFS using the webhdfs RESTAPI part by part. When I define a small part (~5MB), I can see the remaining disk space decreases in relation to my upload. However the non-HDFS is also consumed while uploading but much faster. Because of that, the non-DFS reaches 0% and the upload stops.

After the upload, the non-HDFS increases and reach 18.7GB again...

Here are some data :

File to upload : 2.2GB / Remaining : 9,9GB / non-HDFS used : 18,7GB

Surprisingly, the non-HDFS used reaches 0GB while I upload a 2.2GB file. It doesn't decrease so much when I define a larger part (~50MB). Is it a cache problem ? I tried to use the "buffersize" in my request (corresponding to the part size) but it doesn't seem to change anything.

avatar

@Jonathan Samelson

Non DFS space will be flushed out in if more memory is needed for processing few jobs. Also there will be an interval set in the configuration for clean up as well. Depending on that the memory will be cleared and used by hadoop. In your case when processing 50MB, it doesn't cost much memory than compared to processing 2.2 GB. Based on the size of file used for processing memory will be allocated/de-allocated. I guess it might answer your question.

Below link might give some insights on allocation of memory.

https://books.google.co.in/books?id=H3mvcxPeUfwC&pg=PA114&lpg=PA114&dq=dfs.datanode.du.reserved+exam...

avatar
Explorer

Hi @Bala Vignesh N V

I've finally solved the problem by using the blocksize parameter in the HTTP request. By setting the blocksize to a lower value, the system doesn't overload. I guess it was because the system created temporarily blocks of 64Mb with 5Mb of data. After a while the non-HDFS was overloaded and could not create more temporarily blocks.

I hope I'm clear enough.

avatar

@Jonathan Samelson

Glad you have solved. Be careful in choosing block size if you are going to deal with large chuck of data then its better to choose a higher value. But then if your just getting know few stuffs in HDFS for small amount of data then smaller block size wont affect the process.