Support Questions
Find answers, ask questions, and share your expertise
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Incremental Updates To Impala Table w/ Partition & Parquet

Incremental Updates To Impala Table w/ Partition & Parquet



Recently migrated a +100gb MySQL table to Impala utilizing partitioning and parquet for performance. Queriess run exceptionally well and everything is working as expected against the historical data, however ongoing daily incremental updates raise some questions in terms of how to approach this.

Overall 160-250mb per day is added to the table (~5-6gb/mo). The partition is set on a vendor_id level, but when we look at the incremental updates on a per vendor basis, it may only only 50k-500k that gets added to a specific vendor on any given day. Not a lot per day, but over time it adds up...

An initial curiosity/test was conducted whereby daily incremental data was loaded into a separate table, and then on the partitioned table we ran an INSERT INTO pulling the data from the incremental table into the partitioned table to see what happened on the file level. A new file was added to the partition for a specific vendor but the file size is extremely small, and not appended to existing data files (expected). If we were to run this every day for a year, we would end up with 365 very small files which is not really ideal, right?

How should incremental updates be approached in this situation? If we were using straight CSV files we could run an ETL that just appendeds the new data to a specific file within the partiion sub-folder (if I understand things correctly), but considering the files are in parquet I don't see how this is possible... Does the incremental table which gets INSERTed INTO the main table a good approach? Is there some sort of cleanup which can be run regularily which will basically re-structure(?) things so that the files themselves are re-created and the smaller ones go away and you're left with just fewer larger files?

At a bit of a loss on what makes the most sense and any guidance is greatly appreciated!






Re: Incremental Updates To Impala Table w/ Partition & Parquet

Rising Star
This is often achieved with a daily merge process that joins the base and incremental data and creates a new partitioned table to replace the entire existing base table.

For extremely large tables this might not be practical, but for 100GB tables this is often not too long of a process. If you come from a RDBMS background you may be surprised at how fast Hadoop can do this. The merge logic is usually expressible with Impala SQL alone.

In the future a better solution may be to use the Kudu storage layer. That will allow in-place incremental updates while maintaining fast analytics performance for your end user queries. For now, though, Kudu is in beta and not production ready.

Re: Incremental Updates To Impala Table w/ Partition & Parquet


Thanks Jeremy! After some further digging into specifics with some additional Cloudera contacts, the solution I am seeing is this:

  • incremental data from MySQL gets loaded into a temporary table (table_tmp) each day (straight up overwrite the csv on HDFS)
  • then each day we use an INSERT INTO table_a from table_tmp where table_a is setup using parquet and dynamic partitioning on the field we specify
  • on a daily, weekly, or monthy schedule (likely monthly if the performance doesn't decrease too much having 30ish small files in each partition) we run an INSERT OVERWRITE table_b from table_a. The table_b is setup exactly the same as table_a: partitioned + parquet
  • we also setup a view and alter it each month to point to the current production table (as per
  • Turns out you can do an INSERT OVERWRITE on the same table, so on a regular schedule we would run that, which will get everything back to few big files. I just ran it on the test table we're working with and it ran in ~10min which is great


Need to do some further diggin into additional performance improvements and continue the testing, but I think we definitely have something to move forward with.