Support Questions

vijaysinghparma · ‎09-09-2017

Hi,

In our existing system around 4-6 Million small files are generated in a week. They are generated in different directories and the size of file also varies (<=7MB). This is creating lot of unnecessary cluttering and performance issues. Is there a way out that can help in resolving this issue?

StevenONeill · ‎09-10-2017

@Vijay Parmar, you can concatenate Hive tables to merge small files together. This can happen while the table is active. The syntax is:

ALTER TABLE table_name [PARTITION (partition_key = 'partition_value' [, ...])] CONCATENATE;

See the Hive documentation for details.

vijaysinghparma · ‎09-11-2017

@Steven O'Neill Thank you for the suggestion.

As of now, other solution suggested by DBAs is to: create temporary tables, flatten the files and load them in these tables throughout the day, take backup at the end of the day in different tables and delete the temporary tables.

Doesn't seems to be a good solution but a workaround. I'd appreciate if some insight can be shared on this.

StevenONeill · ‎09-12-2017

@Vijay Parmar, I'd suggest a few tests of concatenation vs the temporary table solution suggested by your DBAs vs whatever else you come up with. Once you get a feel for how the processing works you'll arrive at the solution that best works for you.

balavignesh_nag · ‎09-12-2017

Hi @Vijay Parmar

Apart from the concatenate option in hive which was mentioned by @Steven O'Neill try using these options below :

Depending upon the execution engine first set property differ. You can also modify the size of file in the below option. By these options you can merge the small files based on the input data however it will alter the existing data in the target table but it will be able to able to solve the problem in the future if there are small files being created.

set hive.merge.tezfiles=true; -- Notifying that merge step is required
set hive.merge.smallfiles.avgsize=128000000; --128MB
set hive.merge.size.per.task=128000000; -- 128MB

vijaysinghparma · ‎09-19-2017

@Bala Vignesh N V Along with the suggested properties/ configurations . there are other properties/ configurations benf set to resolve the issue. Really appreciate for your advice.

changhsin_lee · ‎01-31-2018

This is awesome!! Thanks for sharing.

fraser · ‎10-03-2017

It has been suggested to me that CONCATENATE is not necessarily reliable, I didn't see issues in my testing however.

If your table is partitioned and only "recent" partitions are updated you can "reprocess" these partitions periodically to consolidate files:

create temp table
INSERT OVERWRITE temp_table AS SELECT ... date='whatever'
swap partitions (DROP, RENAME or EXCHANGE as you like to move the new partition to live table)

We take this approach for some cases (reprocess recent partitions).

An alternative might be ACID tables which support compaction automatically - I have no experience with this option at all.

Cloudera Community

Support Questions

Facing small file issue on Hive