Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

HDFS file and block size

avatar
Contributor

I got below details through hadoop fsck /

Total size: 41514639144544 B (Total open files size: 581 B)

Total dirs: 40524

Total files: 124348 Total symlinks: 0 (Files currently being written: 7)

Total blocks (validated): 340802 (avg. block size 121814540 B) (Total open file blocks (not validated): 7) Minimally replicated blocks: 340802 (100.0 %)

 

I am usign 256MB block size. so 340802 blocks * 256 MB = 83.2TB * 3(replicas) =249.6 TB but in cloudera manager it shows 110 TB disk used. how is it possible?

 

Does this mean even though block size is 256MB, small file doesnt use the whole block for itself?

2 ACCEPTED SOLUTIONS

avatar
Mentor

avatar
Master Collaborator

You won't save HDFS filesystem space by "archiving" or "combining" small files. In many scenarios you will get a performance boost from combining. You will also reduce the metadata overhead on the namenode by combining as well. 

View solution in original post

4 REPLIES 4

avatar
Mentor
Yes, blocks are not pre-allocated. They are a logical division unit. Read
https://wiki.apache.org/hadoop/FAQ#If_a_block_size_of_64MB_is_used_and_a_file_is_written_that_uses_l...

avatar
Contributor
I have 194945 files that are less than 50MB and these files occupying 884GB memory. how to calculate the memory that these files will occupy if I hadoop archive them. 2) Am I using my hdfs efficiently as there are small files and I am not wasting any memory here. 3) Does archiving really save my disk space or it just reduces the namesapce ovevrhead. Harsh can you give me a detailed picture of this.

avatar
Contributor
So even though I archive these files, I wont be saving any disk space, is that right.

avatar
Master Collaborator

You won't save HDFS filesystem space by "archiving" or "combining" small files. In many scenarios you will get a performance boost from combining. You will also reduce the metadata overhead on the namenode by combining as well.