Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

HDFS storage check shows different values

avatar
Master Collaborator

hdfs dfs -du -h -s /
221.1 T 637.9 T /

====================
 hdfs dfs -du -h -s .
204.2 M 1.2 G .

 

=================

But in the UI i see it's 670 T

 

I'm sure i'm missing something but cann't find it.

 

Configured Capacity: DFS Used: Non DFS Used: DFS Remaining: DFS Used%: DFS Remaining%: Block Pool Used:

1.02 PB
670.54 TB
283.37 GB
368.96 TB
64.49%
35.48%
670.54 TB
6 REPLIES 6

avatar
Champion

Could you run the below commands and  post the results

I am curious , whats your replication factor ? 

 

hadoop fsck path to directory

hadoop fs -du -s  path to directory 

The above commands should give us the same results.

Both only caclulates hdfs raw data without considering the replication factor. 

 

The below will calculate the file size across the nodes( hard disk ) and replication factor .

hadoop fs -count -q /path/to/directory

we can compare  the results  pertain to how much  HDFS space has been consumed and   run against  Namenode UI results . 

 

avatar
Champion
The du switch gets the size for the given directory. The first number is the single replica size and the second number is the size at the full replication factor .

The UI and even CM do a different calc and it is annoying as it isn't what I would call accurate. In the last few days I saw a JIRA related to it on how Non-DFS and the Reserved space are using in the calculation.

I don't have the current calc in front of me but it is different. It is obvious when you tally up the space space used (including non-dfs), and unused, and even the percentage. It will never equal 100%. And it will never equate to your raw disk availability.

I may get this wrong but it is related to amount of you have reserved for non-dfs data. That lops of the configured capacity but then the system also uses it to calculate the non-dfs used in a weird way that always says that there is more used than there ever is.

avatar
Master Collaborator

Total size: 253714473531851 B (Total open files size: 11409372739 B)
Total dirs: 1028908
Total files: 7639121
Total symlinks: 0 (Files currently being written: 107)
Total blocks (validated): 8781147 (avg. block size 28893090 B) (Total open file blocks (not validated): 149)
Minimally replicated blocks: 8781147 (100.0 %)
Over-replicated blocks: 0 (0.0 %)
Under-replicated blocks: 0 (0.0 %)
Mis-replicated blocks: 0 (0.0 %)
Default replication factor: 3
Average block replication: 2.8528664
Corrupt blocks: 0
Missing replicas: 0 (0.0 %)
Number of data-nodes: 30
Number of racks: 1
FSCK ended at Mon Feb 20 21:33:23 EST 2017 in 190136 milliseconds

The filesystem under path '/' is HEALTHY


hadoop fs -du -s /
244412682417174 708603392967605 /

hadoop fs -count -q /
9223372036854775807 9223372036846392726 none inf 987886 7395195 244417466380498 /

 

the non HDFS reserved space is 10 GB and for 30 nodes so it's should not exceed 1 T with replication factor 3.

 

it's really annoying.

 

 

avatar
Master Collaborator

So i shouldn't search for the missing 40 T and the right storage is what the fsck shows?

avatar
Master Collaborator

is it becoming so annoying since the difference now between the UI ( or hdfs dfsadmin -report) and hdfs dfs -du -h -s now 150T, i delete all the hdfs snapshots and disallow them but still get the same results.

avatar
Master Collaborator

I figured out the issue.

 

The diffetence comes from /tmp/logs.

 

Weird why hdfs dfs -du -h -s / is not considering /tmp/logs.