Support Questions

Find answers, ask questions, and share your expertise

What is the small file problem?

What is the small file problem in Hadoop HDFS?


Super Mentor

A Small file typically means file size less than the default block size(64 MB). Small file can affect the performance of complete Hadoop clusters.

Problems caused by Small files to HDFS are as follows:
If we store a lot of small blocks in the datanode, namenode should store all the metadata in its memory. Namenode typicaly stores files, directories,blocks etc in its memory and each of these will be represented as an object of 150 bytes approx. So, a lot of metadata in namenode memory can slow down the system and affect the performance.
Secondly, if a client comes and looks for the data in the block present in a datanode, and if the block is small and a part of client’s data is present in another node, client needs to communicate again with namenode to get the location of the new block for reading rest of data. This can slow down the system and affect the performance.

Problems caused by Small files to Mapreduce are as follows:
In Mapreduce design, a map process a single block at a time. For this a JVM starts. Consider there are millions of small blocks, so millions of map tasks have to be initiated and this can take time to process all the data. But at the same time if a single map can process an ideal sized(64MB) block, it can result in optimum performance in Mapreduce processing.