I have a partitioned table (using parquet) with full stats computed.
When i do the following:
- add a new timestamp column
- add a few new partitions
- issue a compute incremental stats (without stating which partitions to compute)
i assumed only the new partitions are scanned and the new column for every old partition.
What i see is that Impala is recomputing the full stats for the complete table and all columns.
The table contains almost 300 billion rows so this will take a very long time.
say the following:
If you use an ALTER TABLE statement to add a column, Impala rescans all partitions and fills in the appropriate column-level values the next time you run COMPUTE INCREMENTAL STATS.
i assumed that "appropriate column-level values" means only the new column values, am i reading this wrong?
Or is this not the correct Impala behaviour that i am seeing?
Thanks for any help/insight into this.
EDIT 2 hours later:
- I remember i also removed some 150 partitions before i altered the table and added the column. And after this
the total number of rows in the table (as shown by show partitions) was not correct (count(*) from table showed different lower number of rows)
What i did now is:
- cancel the update incremental stats query. is was <2% complete after almost 4 hours.
- added the stats for the new column manually with alter table t set column stats col_ts ('numDVs'='0', 'numNulls'='0');
This did not have the desired afffect a new update incremental stats query again started analyzing the entire table.
- Then i calculated the stats for a single new partition by adding the PARTITION(...) clause to the update incremental stats query. (which i should not have to do)
- After this single partition was updated i ran the update incremental stats command again WITHOUT the PARTITION clause and now it only analyzed the new partitions and NOT the entire table!
The total number of rows in the table shown by "show partitions" is now correct.