Support Questions

Find answers, ask questions, and share your expertise
Announcements
Check out our newest addition to the community, the Cloudera Data Analytics (CDA) group hub.

How we can define block-size to manage HDFS storage & processing efficiently?

Explorer

@Geoffrey Shelton Okot

I have a cluster which has 220 million files and out of which 110 million is less than 1 MB in size.

Default block size is set to 128 MB.

What should be the blocksize for file less than 1 MB? And How we can set in live cluster?

Total Files + Directories: 227008030

Disk Remaining: 700 TB / 3.5 PB (20%)

2 REPLIES 2

Hi @SP

Lots of small files in the cluster damages the cluster health with the blocksize you have.

Before coming to changing the block size, check if there is a possibility of combining the files. If there are similar set of files which can be combined together then you may need to do that first so that the size of the file can be considerable.

Also does the cluster has only these small files which are <1 MB if that is the case then its useful to think about changing the blocksize. But if you have other big files which has multiple splits then there instead of changing the bolcksize you may need to think about combining the small files as I have mentioned earlier. Or If you have separate clusters for hot/cold/warm data and if these files belong to cold data then you can very well reduce the block size but it will fail the aim of hdfs which works well of distributed system. Also if the bock size are reduced you may need to touch on other configuration parameters like mapper size/ reducer size/ input split size etc

Explorer

@ Bala Vignesh N V

Thanks for your reply. Please see the data type numbers according to size. Also as you have suggested for combining files....like HAR implementation, it will be very difficult as data scattered across multiple hierarchical directory and I am not able to find any automated script/tools to do the same other than manual option. Again for manual option also I need to move data which will cause performance of the cluster. And no option for hot/cold/warm splitting as it is a single production cluster.

110 million < 1MB
19 million =>1M to 32MB
2.5 million => 32M to 64M
9 lakh => 64M to 128M
1 million > 128M

Let me know your thought.

Take a Tour of the Community
Don't have an account?
Your experience may be limited. Sign in to explore more.