Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Handling small file in Hadoop

Highlighted

Handling small file in Hadoop

Contributor

Hello Team,

We have a large number of small files in HDFS and periodically we want to merge small files as handling too many small files in HDFS can be an issue. Looking for advice on how to handle that.

One possible solution is to use Haddop Archive but that would result in changing the Hive external table file location. Situation is we will get small files every day and we want to merge all the files for a month into one large file.

Thanks and Regards,

Rajdip

1 REPLY 1

Re: Handling small file in Hadoop

That's how I would do it:

  • create an EXTERNAL table with 3 partitions, mapped on 3 directories e.g. new_data, reorgand history
  • feed the new files into new_data
  • implement a job to run the batch compaction, and run it periodically

Now the batch compaction logic:

  1. make sure that no SELECT query will be executed while the compaction is running, else it would return duplicates
  2. select all files that are ripe for compaction (define your own criteria eg a month in this case) and move them from new_data directory to reorg
  3. merge the content of all these reorg files, into a new file in history dir (feel free to GZip it on the fly, Hive will recognize the .gz extension)
  4. drop the files in reorg
Don't have an account?
Coming from Hortonworks? Activate your account here