Support Questions

Find answers, ask questions, and share your expertise

Turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

- Cloudera Community
- :
- Support
- :
- Support Questions
- :
- Re: Impala compute incremental stats not working a...

Announcements

Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page

- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page

Highlighted
##

Impala compute incremental stats not working as expected

Labels:

Contributor

Created on 11-22-2016 12:45 AM - edited 11-22-2016 02:23 AM

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

I have a partitioned table (using parquet) with full stats computed.

When i do the following:

- add a new timestamp column

- add a few new partitions

- issue a compute incremental stats (without stating which partitions to compute)

i assumed only the new partitions are scanned and the new column for every old partition.

What i see is that Impala is recomputing the full stats for the complete table and all columns.

The table contains almost 300 billion rows so this will take a very long time.

The docs: https://www.cloudera.com/documentation/enterprise/5-8-x/topics/impala_perf_stats.html

say the following:

If you use an ALTER TABLE statement to add a column, Impala rescans all partitions and fills in the appropriate column-level values the next time you run COMPUTE INCREMENTAL STATS.

i assumed that "appropriate column-level values" means only the new column values, am i reading this wrong?

Or is this not the correct Impala behaviour that i am seeing?

Thanks for any help/insight into this.

EDIT 2 hours later:

- I remember i also removed some 150 partitions before i altered the table and added the column. And after this

the total number of rows in the table (as shown by show partitions) was not correct (count(*) from table showed different lower number of rows)

What i did now is:

- cancel the update incremental stats query. is was <2% complete after almost 4 hours.

- added the stats for the new column manually with alter table t set column stats col_ts ('numDVs'='0', 'numNulls'='0');

This did not have the desired afffect a new update incremental stats query again started analyzing the entire table.

- Then i calculated the stats for a single new partition by adding the PARTITION(...) clause to the update incremental stats query. (which i should not have to do)

- After this single partition was updated i ran the update incremental stats command again WITHOUT the PARTITION clause and now it only analyzed the new partitions and NOT the entire table!

The total number of rows in the table shown by "show partitions" is now correct.

2 REPLIES 2

Highlighted
##

I had to compute incremental stats using a script which would pass partition values, calculating entire table stats at a time ran out of memory.

Re: Impala compute incremental stats not working as expected

Explorer

Created 11-25-2016 05:50 AM

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Re: Impala compute incremental stats not working as expected

Explorer

Created 01-19-2017 12:56 PM

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

In any case, your solution worked like a charm, and I owe you a beer :)

Thanks!

Coming from Hortonworks? Activate your account here