Created 09-09-2017 06:58 AM
Hi,
In our existing system around 4-6 Million small files are generated in a week. They are generated in different directories and the size of file also varies (<=7MB). This is creating lot of unnecessary cluttering and performance issues. Is there a way out that can help in resolving this issue?
Created 09-10-2017 11:26 PM
@Vijay Parmar, you can concatenate Hive tables to merge small files together. This can happen while the table is active. The syntax is:
ALTER TABLE table_name [PARTITION (partition_key = 'partition_value' [, ...])] CONCATENATE;
See the Hive documentation for details.
Created 09-11-2017 03:11 AM
@Steven O'Neill Thank you for the suggestion.
As of now, other solution suggested by DBAs is to: create temporary tables, flatten the files and load them in these tables throughout the day, take backup at the end of the day in different tables and delete the temporary tables.
Doesn't seems to be a good solution but a workaround. I'd appreciate if some insight can be shared on this.
Created 09-12-2017 01:32 AM
@Vijay Parmar, I'd suggest a few tests of concatenation vs the temporary table solution suggested by your DBAs vs whatever else you come up with. Once you get a feel for how the processing works you'll arrive at the solution that best works for you.
Created 09-12-2017 09:47 AM
Apart from the concatenate option in hive which was mentioned by @Steven O'Neill try using these options below :
Depending upon the execution engine first set property differ. You can also modify the size of file in the below option. By these options you can merge the small files based on the input data however it will alter the existing data in the target table but it will be able to able to solve the problem in the future if there are small files being created.
set hive.merge.tezfiles=true; -- Notifying that merge step is required
set hive.merge.smallfiles.avgsize=128000000; --128MB
set hive.merge.size.per.task=128000000; -- 128MB
Created 09-19-2017 02:43 AM
@Bala Vignesh N V Along with the suggested properties/ configurations . there are other properties/ configurations benf set to resolve the issue. Really appreciate for your advice.
Created 01-31-2018 07:24 PM
This is awesome!! Thanks for sharing.
Created 10-03-2017 08:01 PM
It has been suggested to me that CONCATENATE is not necessarily reliable, I didn't see issues in my testing however.
If your table is partitioned and only "recent" partitions are updated you can "reprocess" these partitions periodically to consolidate files:
We take this approach for some cases (reprocess recent partitions).
An alternative might be ACID tables which support compaction automatically - I have no experience with this option at all.