Created 01-03-2018 07:59 PM
Suppose a scenario with a Hive table that is partitioned by day ("day=2017-12-12").
Suppose some process pushes data to the file store behind this table (new data under "day=2017-12-12" and "day=2017-12-13", etc).
The "msck repair table" command updates the metastore to recognize all the new "chunks", and the data correctly shows up in queries.
But suppose these chunks are mostly very small - is there a simple command to consolidate these?
So instead of 100 small files under a partition, I get 2 well-sized ones, etc.
I recognize that I can create a copy of the table and accomplish this, but that seems pretty clumsy.
Is there some kind of hdfs command to "defrag" the data?
FWIW, I'm using EMR with data in S3.
Thanks in advance.
Created 01-04-2018 01:37 PM
You can create a separate table with only the current day's partition. Merge/consolidate the small files into it.
Then run an exchange partition into the main table to achieve this.
In that way, you do not need to play with the entire data in the main table and this kind of gives you a clean way to achieve this.
Create a shell script and put the commands and do the boundary checks like the end of the day, reprocessing etc and you will have an airtight solution.
Thanks,
Raj
Created 01-04-2018 01:37 PM
You can create a separate table with only the current day's partition. Merge/consolidate the small files into it.
Then run an exchange partition into the main table to achieve this.
In that way, you do not need to play with the entire data in the main table and this kind of gives you a clean way to achieve this.
Create a shell script and put the commands and do the boundary checks like the end of the day, reprocessing etc and you will have an airtight solution.
Thanks,
Raj
Created 01-08-2018 02:51 PM
Thanks for the feedback