In our existing system around 4-6 Million small files are generated in a week. They are generated in different directories and the size of file also varies (<=7MB). This is creating lot of unnecessary cluttering and performance issues. Is there a way out that can help in resolving this issue?
@Steven O'Neill Thank you for the suggestion.
As of now, other solution suggested by DBAs is to: create temporary tables, flatten the files and load them in these tables throughout the day, take backup at the end of the day in different tables and delete the temporary tables.
Doesn't seems to be a good solution but a workaround. I'd appreciate if some insight can be shared on this.
@Vijay Parmar, I'd suggest a few tests of concatenation vs the temporary table solution suggested by your DBAs vs whatever else you come up with. Once you get a feel for how the processing works you'll arrive at the solution that best works for you.
Apart from the concatenate option in hive which was mentioned by @Steven O'Neill try using these options below :
Depending upon the execution engine first set property differ. You can also modify the size of file in the below option. By these options you can merge the small files based on the input data however it will alter the existing data in the target table but it will be able to able to solve the problem in the future if there are small files being created.
set hive.merge.tezfiles=true; -- Notifying that merge step is required set hive.merge.smallfiles.avgsize=128000000; --128MB set hive.merge.size.per.task=128000000; -- 128MB
It has been suggested to me that CONCATENATE is not necessarily reliable, I didn't see issues in my testing however.
If your table is partitioned and only "recent" partitions are updated you can "reprocess" these partitions periodically to consolidate files:
We take this approach for some cases (reprocess recent partitions).
An alternative might be ACID tables which support compaction automatically - I have no experience with this option at all.