Recently migrated a +100gb MySQL table to Impala utilizing partitioning and parquet for performance. Queriess run exceptionally well and everything is working as expected against the historical data, however ongoing daily incremental updates raise some questions in terms of how to approach this.
Overall 160-250mb per day is added to the table (~5-6gb/mo). The partition is set on a vendor_id level, but when we look at the incremental updates on a per vendor basis, it may only only 50k-500k that gets added to a specific vendor on any given day. Not a lot per day, but over time it adds up...
An initial curiosity/test was conducted whereby daily incremental data was loaded into a separate table, and then on the partitioned table we ran an INSERT INTO pulling the data from the incremental table into the partitioned table to see what happened on the file level. A new file was added to the partition for a specific vendor but the file size is extremely small, and not appended to existing data files (expected). If we were to run this every day for a year, we would end up with 365 very small files which is not really ideal, right?
How should incremental updates be approached in this situation? If we were using straight CSV files we could run an ETL that just appendeds the new data to a specific file within the partiion sub-folder (if I understand things correctly), but considering the files are in parquet I don't see how this is possible... Does the incremental table which gets INSERTed INTO the main table a good approach? Is there some sort of cleanup which can be run regularily which will basically re-structure(?) things so that the files themselves are re-created and the smaller ones go away and you're left with just fewer larger files?
At a bit of a loss on what makes the most sense and any guidance is greatly appreciated!
Thanks Jeremy! After some further digging into specifics with some additional Cloudera contacts, the solution I am seeing is this:
Need to do some further diggin into additional performance improvements and continue the testing, but I think we definitely have something to move forward with.