Support Questions

Find answers, ask questions, and share your expertise

hdfs disk usage question

avatar
Super Collaborator

why is the below command showing available space so low when the disk is hardly used ? if 3.

[hdfs@hadoop1 ~]$ hdfs dfs -df -h / Filesystem Size Used Available Use% hdfs://hadoop1:8020 240.5 G 3.6 G 150.8 G 2% [hdfs@hadoop1 ~]$

1 ACCEPTED SOLUTION

avatar
Super Guru

@Sami Ahmad

The output you posted shows that 3.6 GB is used and available space is 150.8 GB. You are concerned that available should show 240.5 GB - 3.6 GB = 236.9 GB and not 150.8 GB.

Here is how it goes. HDFS has blocks and each block has a size, let's assume 128 MB/block.If you have multiple small files they will underuse the block size. For example, if you have 10 files of 64 MB each stored in 10 blocks of 128 MB they will underuse the blocks at 50%. The remaining space in those blocks CANNOT be used by other files. It is just wasted and it is not reported as AVAILABLE.

The way hdfs dfs -df -h command works for AVAILABLE is this: it determines the number of blocks available for storing new data (empty) and multiplies that with the block size. Just looking at your numbers above, it shows that the wasted space is 236.9-150.8=86.1 GB. That shows that your block size is set to a value higher than your average file size, about 50%. This is not uncommon, but be aware.

I hope this explanation is good enough. Please don't forget to vote/accept answer that answered your question.

View solution in original post

4 REPLIES 4

avatar
Super Guru

@Sami Ahmad

Can you pass the output of below command -

$hdfs dfsadmin -report

avatar
Super Collaborator

[sqoop@hadoop1 ~]$ hdfs dfsadmin -report Configured Capacity: 258183639040 (240.45 GB) Present Capacity: 165690710832 (154.31 GB) DFS Remaining: 161715403644 (150.61 GB) DFS Used: 3975307188 (3.70 GB) DFS Used%: 2.40% Under replicated blocks: 0 Blocks with corrupt replicas: 0 Missing blocks: 0 Missing blocks (with replication factor 1): 0

avatar
Super Guru

@Sami Ahmad

The output you posted shows that 3.6 GB is used and available space is 150.8 GB. You are concerned that available should show 240.5 GB - 3.6 GB = 236.9 GB and not 150.8 GB.

Here is how it goes. HDFS has blocks and each block has a size, let's assume 128 MB/block.If you have multiple small files they will underuse the block size. For example, if you have 10 files of 64 MB each stored in 10 blocks of 128 MB they will underuse the blocks at 50%. The remaining space in those blocks CANNOT be used by other files. It is just wasted and it is not reported as AVAILABLE.

The way hdfs dfs -df -h command works for AVAILABLE is this: it determines the number of blocks available for storing new data (empty) and multiplies that with the block size. Just looking at your numbers above, it shows that the wasted space is 236.9-150.8=86.1 GB. That shows that your block size is set to a value higher than your average file size, about 50%. This is not uncommon, but be aware.

I hope this explanation is good enough. Please don't forget to vote/accept answer that answered your question.

avatar
Contributor

The given solution is certainly not true unfortunately.

 

In HDFS a given block if it is open for write, then consumes 128MB that is true, but as soon as the file is closed, the last block of the file is counted just by the length of the file.

So if you have a 1KB file, that consumes 3KB disk space considering replication factor 3, and if you have a 129MB file that consumes 387MB disk space again with replication factor 3.

 

The phenomenon that can be seen in the output was most likely caused by other non-DFS disk usage, that made the available disk space for HDFS less, and had nothing to do with the file sizes.

 

Just to demonstrate this with a 1KB test file:

# hdfs dfs -df -h 
Filesystem Size Used Available Use%
hdfs://<nn>:8020 27.1 T 120 K 27.1 T 0%
# fallocate -l 1024 test.txt
# hdfs dfs -put test.txt /tmp
# hdfs dfs -df -h
Filesystem Size Used Available Use%
hdfs://<nn>:8020 27.1 T 123.0 K 27.1 T 0%

I hope this helps to clarify and correct this answer.