We have a large number of small files in HDFS and periodically we want to merge small files as handling too many small files in HDFS can be an issue. Looking for advice on how to handle that.
One possible solution is to use Haddop Archive but that would result in changing the Hive external table file location. Situation is we will get small files every day and we want to merge all the files for a month into one large file.
Thanks and Regards,
That's how I would do it:
Now the batch compaction logic:
reorgfiles, into a new file in
historydir (feel free to GZip it on the fly, Hive will recognize the