How can we resolve small file problem in HDFS?
Small file means files which are considerably smaller than block size(64 MB or 128 MB) from Hadoop perspective. Since Hadoop is used for processing huge amount of data, if we are using small files, a number of files would be obviously large. Hadoop is actually designed for a large amount of data ie a small number of large files.
Following are the issues with small file
1. Each file, directory, and a block in HDFS is represented as an object in name node’s memory (ie Metadata), and each of which occupies approx. 150 bytes. Scaling these much amounts of memory in the name node for each of these objects is not feasible. In short, if a number of files increases, the memory required to store metadata will be more.
2. HDFS is not designed for efficient access of small files. Handling a large number of small files causes a lot of seeks and a lot of hopping from the data node to the data node to retrieve small files. This is an inefficient data access pattern.
3. Mapper node usually takes a block of input at a time. If the file is very small(ie less than typical block size), the number of mapper task would increase and each task process very little input. This would create a lot of task in queue and overhead would be high. This decreases the overall speed and efficiency of map jobs.
1. Hadoop archive Files (HAR): HAR command creates a HAR file, which runs a map reduce job to prevent HDFS data to get archived into small files. HAR ensures file size is large and the number is low.
2. Sequence files: By this method, data is stored in such a way that file name will be kay and file name will be valued. MapReduce programs can be created to make a lot of small files into a single sequence file. MapReduce divides sequence files into parts and works on each part independently.