Support Questions
Find answers, ask questions, and share your expertise
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Impala compute incremental stats not working as expected


Impala compute incremental stats not working as expected



I have a partitioned table (using parquet) with full stats computed.


When i do the following:

 - add a new timestamp column

 - add a few new partitions 

- issue a  compute incremental stats (without stating which partitions to compute)


i assumed only the new partitions are scanned and the new column for every old partition.


What i see is that Impala is recomputing the full stats for the complete table and all columns.

The table contains almost 300 billion rows so this will take a very long time.



The docs:

say the following: 


If you use an ALTER TABLE statement to add a column, Impala rescans all partitions and fills in the appropriate column-level values the next time you run COMPUTE INCREMENTAL STATS.


i assumed that "appropriate column-level values" means only the new column values, am i reading this wrong?


Or is this not the correct Impala behaviour that i am seeing?


Thanks for any help/insight into this.


EDIT 2 hours later:


- I remember i also removed some 150 partitions before i altered the table and added the column. And after this

the total number of rows in the table (as shown by show partitions) was not correct (count(*) from table showed different lower number of rows)


What i did now is:

 - cancel the update incremental stats query. is was <2% complete after almost 4 hours.

- added the stats for the new column manually with alter table t set column stats col_ts ('numDVs'='0', 'numNulls'='0');


This did not have the desired afffect a new  update incremental stats query again started analyzing the entire table.


- Then i calculated the stats for a single new partition by adding the PARTITION(...) clause to the update incremental stats query. (which i should not have to do)

- After this single partition was updated i ran the update incremental stats command again WITHOUT the PARTITION clause and now it only analyzed the new partitions and NOT the entire table!


The total number of rows in the table shown by "show partitions" is now correct.




Re: Impala compute incremental stats not working as expected

I had to compute incremental stats using a script which would pass partition values, calculating entire table stats at a time ran out of memory.

Re: Impala compute incremental stats not working as expected

I was having this same issue after adding a column to a table with ~2K partitions. What's perplexing is that Impala would rerun all this stats gathering for a column that doesn't physically exist in those old partitions. Seems like this could be done better, hopefully an Impala developer can comment.

In any case, your solution worked like a charm, and I owe you a beer :)

Don't have an account?
Coming from Hortonworks? Activate your account here