Support Questions

Find answers, ask questions, and share your expertise
Check out our newest addition to the community, the Cloudera Data Analytics (CDA) group hub.

What is the small file problem?If store 1 million small files in HDFS,will there be any issue?

What is the small file problem? If store 1 million small files in HDFS,will there be any issue?



@Himani Bansal

Storing a million small files in HDFS is possible,HDFS is not geared up to efficiently accessing small files: it is primarily designed for streaming access of large files. Reading through small files normally causes lots of seeks and lots of hopping from data node to data node to retrieve each small file, all of which is an inefficient data access pattern.

Problems with small files and MapReduce

Map tasks usually process a block of input at a time (using the default FileInputFormat ). If the file is very small and there are a lot of them, then each map task processes very little input, and there are a lot more map tasks, each of which imposes extra bookkeeping overhead. Compare a 1GB file broken into 16 64MB blocks, and 10,000 or so 100KB files. The 10,000 files use one map each, and the job time can be tens or hundreds of times slower than the equivalent one with a single input file.

There are a couple of features to help alleviate the bookkeeping overhead: task JVM reuse for running multiple map tasks in one JVM, thereby avoiding some JVM startup overhead see the property mapred.job.reuse.jvm.num.task , and MultiFileInputSplit which can run more than one split per map.

In general Hadoop handles big files very well, but when the files are small, it just passes each small file to a map () function, which is not very efficient because it will create a large number of mappers. For example, the 1,000’s files of size (2 to 3 MB) will need 1,000 mappers which very inefficient.



There are a lot of latency issue with datanodes because when the datanodes reports the hearbeats to the namenode about the no of blocks so a lot of effort (in terms of resource memory like heap size) of the namenode goes into updating the metadata about the files (since there are a lot). We faced issues in production where we have small files issue in the cluster and datanodes got down intermittently and came up. We also increased namenode heap size to fix the issue.

The small files are those which are significantly smaller than the default HDFS file size.i.e;64 MB.HDFS can’t handle these small files efficiently.

If store 1 million files on HDFS, it will utilize a lot of Name node space to store the metadata of files, which will make the processing very slow.


@Himani Bansal

Can you review the answers to your question ,accept the most appropriate so this thread can be closed.

Take a Tour of the Community
Don't have an account?
Your experience may be limited. Sign in to explore more.