Support Questions

Find answers, ask questions, and share your expertise

Handling small file in Hadoop


Hello Team,

We have a large number of small files in HDFS and periodically we want to merge small files as handling too many small files in HDFS can be an issue. Looking for advice on how to handle that.

One possible solution is to use Haddop Archive but that would result in changing the Hive external table file location. Situation is we will get small files every day and we want to merge all the files for a month into one large file.

Thanks and Regards,



That's how I would do it:

  • create an EXTERNAL table with 3 partitions, mapped on 3 directories e.g. new_data, reorgand history
  • feed the new files into new_data
  • implement a job to run the batch compaction, and run it periodically

Now the batch compaction logic:

  1. make sure that no SELECT query will be executed while the compaction is running, else it would return duplicates
  2. select all files that are ripe for compaction (define your own criteria eg a month in this case) and move them from new_data directory to reorg
  3. merge the content of all these reorg files, into a new file in history dir (feel free to GZip it on the fly, Hive will recognize the .gz extension)
  4. drop the files in reorg
Take a Tour of the Community
Don't have an account?
Your experience may be limited. Sign in to explore more.