Support Questions

Find answers, ask questions, and share your expertise

Facing small file issue on Hive

avatar
Contributor

Hi,

In our existing system around 4-6 Million small files are generated in a week. They are generated in different directories and the size of file also varies (<=7MB). This is creating lot of unnecessary cluttering and performance issues. Is there a way out that can help in resolving this issue?

7 REPLIES 7

avatar
Contributor

@Vijay Parmar, you can concatenate Hive tables to merge small files together. This can happen while the table is active. The syntax is:

ALTER TABLE table_name [PARTITION (partition_key = 'partition_value' [, ...])] CONCATENATE;

See the Hive documentation for details.

avatar
Contributor

@Steven O'Neill Thank you for the suggestion.

As of now, other solution suggested by DBAs is to: create temporary tables, flatten the files and load them in these tables throughout the day, take backup at the end of the day in different tables and delete the temporary tables.

Doesn't seems to be a good solution but a workaround. I'd appreciate if some insight can be shared on this.

avatar
Contributor

@Vijay Parmar, I'd suggest a few tests of concatenation vs the temporary table solution suggested by your DBAs vs whatever else you come up with. Once you get a feel for how the processing works you'll arrive at the solution that best works for you.

avatar

Hi @Vijay Parmar

Apart from the concatenate option in hive which was mentioned by @Steven O'Neill try using these options below :

Depending upon the execution engine first set property differ. You can also modify the size of file in the below option. By these options you can merge the small files based on the input data however it will alter the existing data in the target table but it will be able to able to solve the problem in the future if there are small files being created.

set hive.merge.tezfiles=true; -- Notifying that merge step is required
set hive.merge.smallfiles.avgsize=128000000; --128MB
set hive.merge.size.per.task=128000000; -- 128MB

avatar
Contributor

@Bala Vignesh N V Along with the suggested properties/ configurations . there are other properties/ configurations benf set to resolve the issue. Really appreciate for your advice.

avatar
New Contributor

This is awesome!! Thanks for sharing.

avatar
New Contributor

It has been suggested to me that CONCATENATE is not necessarily reliable, I didn't see issues in my testing however.

If your table is partitioned and only "recent" partitions are updated you can "reprocess" these partitions periodically to consolidate files:

  1. create temp table
  2. INSERT OVERWRITE temp_table AS SELECT ... date='whatever'
  3. swap partitions (DROP, RENAME or EXCHANGE as you like to move the new partition to live table)

We take this approach for some cases (reprocess recent partitions).

An alternative might be ACID tables which support compaction automatically - I have no experience with this option at all.