Support Questions
Find answers, ask questions, and share your expertise
Announcements
Check out our newest addition to the community, the Cloudera Innovation Accelerator group hub.

why 'show column stats <table_name>` doesn't show statistics generated by Hive 'Analyze Table ...

Expert Contributor

 

Hello,

I am researching how to statistics to optimize queries.
From I read from here, http://www.cloudera.com/content/www/en-us/documentation/enterprise/latest/topics/impala_perf_stats.h...
Although Impala recommends to use `COMPUTE STATS` or `COMPUTE INCREMENTAL STATS` depending on use cases, 
I can still generate statistics from Hive by using `ANALYZE TABLE ... COMPUTE STATISTICS` and/or `ANALYZE TABLE ... COMPUTE STATISTICS FOR COLUMNS;`

After I generate statistics by `ANALYZE TABLE ... COMPUTE STATISTICS` on Hive,
I could see statistics for table and partition on Impala by running `show table stats Table1`.

However, I couldn't see any updated column statistics even after `ANALYZE TABLE ... COMPUTE STATISTICS FOR COLUMNS;`.
Of course, I used `SHOW COLUMN STATS Tables1`.

Is this a bug or am I missing anything?

Thank you for your help in advance.

2 ACCEPTED SOLUTIONS

Master Collaborator

Ok, looks like I was wrong in assuming that Hive would compute the column stats on a table level.

For a partitioned table, Hive's ANALYZE TABLE command will compute the column stats on a per-partition basis.

It's not clear that this approach even makes sense because how will one then aggregate the different distinct-value stats across partitions? Seems like those stats would be wildly inaccurate, so maybe this is not a good flow anyway, even if we could make it work.

 

That's why the stats do not show up in Impala. The flow of computing column stats in Hive and then using them in Impala will currently not work for partitioned tables.

 

View solution in original post

Master Collaborator

Thanks for the detailed explanation!

 

See my responses below:

 

The term "partition" usually means a complete partition specification, i.e., you can construct a path into HDFS that contains data files (and not folders for the next level partition values).

 

RE: Question 1

That's correct. It will do a full recomputation. However, it might be worth questioning your partitioning strategy. Typically we do not recommend more than 10k partitions per table. In extreme cases you might go up to 100k, but that is really stretching it, and you need to fully understand the consequences.

The 10k recommendation stems from limitations in Impala and the Hive Metastore in scaling up to an extreme number of partitions.

 

RE: Question 2

Probably yes. If you have an extreme number of partitions, then compute incremental stats is not recommended due to the additional memory requirements.

 

RE: Question 3

Doing concurrent reads should be fine. Doing concurrent writes with INSER INTO should also be fine as long as you do not OVERWRITE. However, there is no guarantee as to which partitions/files will be seen by COMPUTE STATS. It may see some but not other results from your INSERT.

 

View solution in original post

15 REPLIES 15

Master Collaborator

Hi!

 

your scenario should work. Did you do "invalidate metadata <table>" in Impala after computing the stats in Hive?

 

Also, Impala only deals with column stats at the table level, so if you compute the column stats for a specific partition in Hive, then those stats will not show up in Impala.

 

 

Expert Contributor

Thank you for your feedback.

 

Yes. I did `INVALIDATE METADATA <table_name>`;

As I mentioned in the questoin, I can see the updated table statistics by `show table stats <table_name>` after executing `ANALYZE TABLE Table1 PARTITION(yearmonth) COMPUTE STATISTICS;`

 

Query: show table stats products
+-----------+----------+--------+----------+--------------+-------------------+---------+-------------------+
| yearmonth | #Rows    | #Files | Size     | Bytes Cached | Cache Replication | Format  | Incremental stats |
+-----------+----------+--------+----------+--------------+-------------------+---------+-------------------+
| 201404    | 48799023 | 9      | 866.37MB | NOT CACHED   | NOT CACHED        | RC_FILE | false             |
| 201405    | 54633812 | 11     | 968.30MB | NOT CACHED   | NOT CACHED        | RC_FILE | false             |
| 201406    | 49516351 | 11     | 873.53MB | NOT CACHED   | NOT CACHED        | RC_FILE | false             |
| 201407    | 51782891 | 12     | 934.59MB | NOT CACHED   | NOT CACHED        | RC_FILE | false             |
| 201408    | 73191206 | 13     | 1.23GB   | NOT CACHED   | NOT CACHED        | RC_FILE | false             |
| 201409    | 76577223 | 17     | 1.31GB   | NOT CACHED   | NOT CACHED        | RC_FILE | false             |
| 201410    | 78462480 | 17     | 1.37GB   | NOT CACHED   | NOT CACHED        | RC_FILE | false             |
| 201411    | 78778172 | 17     | 1.37GB   | NOT CACHED   | NOT CACHED        | RC_FILE | false             |
| 201412    | 78778304 | 17     | 1.37GB   | NOT CACHED   | NOT CACHED        | RC_FILE | false             |
| 201501    | 78914761 | 17     | 1.37GB   | NOT CACHED   | NOT CACHED        | RC_FILE | false             |
| 201502    | 79112909 | 17     | 1.36GB   | NOT CACHED   | NOT CACHED        | RC_FILE | false             |
| 201503    | 79270403 | 17     | 1.37GB   | NOT CACHED   | NOT CACHED        | RC_FILE | false             |
| 201504    | 79315850 | 17     | 1.37GB   | NOT CACHED   | NOT CACHED        | RC_FILE | false             |
| 201505    | 79626491 | 17     | 1.37GB   | NOT CACHED   | NOT CACHED        | RC_FILE | false             |
| 201506    | 79644598 | 17     | 1.42GB   | NOT CACHED   | NOT CACHED        | RC_FILE | false             |
| 201507    | 79741074 | 17     | 1.42GB   | NOT CACHED   | NOT CACHED        | RC_FILE | false             |
| 201508    | 79798934 | 18     | 1.43GB   | NOT CACHED   | NOT CACHED        | RC_FILE | false             |
| 201509    | 79920969 | 18     | 1.43GB   | NOT CACHED   | NOT CACHED        | RC_FILE | false             |
| 201510    | 79950252 | 18     | 1.43GB   | NOT CACHED   | NOT CACHED        | RC_FILE | false             |
| 201511    | 79965601 | 18     | 1.44GB   | NOT CACHED   | NOT CACHED        | RC_FILE | false             |
| 201512    | 80038743 | 5      | 1.41GB   | NOT CACHED   | NOT CACHED        | RC_FILE | false             |
| Total     | -1       | 320    | 27.03GB  | 0B           |                   |         |                   |
+-----------+----------+--------+----------+--------------+-------------------+---------+-------------------+

 

 

However, I can't see the updated column statistics by `show column stats <table_name>` after executing `ANALYZE TABLE Table1 COMPUTE STATISTICS FOR COLUMNS;`

Query: show column stats products
+-----------------+--------+------------------+--------+----------+----------+
| Column          | Type   | #Distinct Values | #Nulls | Max Size | Avg Size |
+-----------------+--------+------------------+--------+----------+----------+
| product_name   | STRING | -1               | -1     | -1       | -1       |
| token_count     | INT    | -1               | -1     | 4        | 4        |
| currency        | STRING | -1               | -1     | -1       | -1       |
| yearmonth       | INT    | 21               | 0      | 4        | 4        |
+-----------------+--------+------------------+--------+----------+----------+

 

Of course, I execute `INVALIDATE METADATA <table_name>` statistics in Impala.

 

Thank you.

 

 

Master Collaborator

Thanks.

 

I can reproduce the problem on a partitioned table (unpartitioned works), give me some time to look into it.

Expert Contributor
Thank you so much 🙂

Master Collaborator

Ok, looks like I was wrong in assuming that Hive would compute the column stats on a table level.

For a partitioned table, Hive's ANALYZE TABLE command will compute the column stats on a per-partition basis.

It's not clear that this approach even makes sense because how will one then aggregate the different distinct-value stats across partitions? Seems like those stats would be wildly inaccurate, so maybe this is not a good flow anyway, even if we could make it work.

 

That's why the stats do not show up in Impala. The flow of computing column stats in Hive and then using them in Impala will currently not work for partitioned tables.

 

Expert Contributor

Thank you for your investigation.

 

I am using `COMPUTE INCREMENTAL STATS` for the table which I add new yearmonth partition every month.

However, `COMPUTE INCREMENTAL STATS` takes about 2hr and it takes longer and longer.

 

To me, it is because `COMPUTE INCREMENTAL STATS` does two operation for the given table.

1. counting the total number of rows.

2. doing full recomputation of incremental stats if too many partitions are selected.

 

I'm looking for some other ways to reduce time for gathering statistics.

 

This is log I get during `COMPUTE INCREMENTAL STATS`

 

+---------------------------------------------+
| summary                                     |
+---------------------------------------------+
| Updated 1372 partition(s) and 29 column(s). |
+---------------------------------------------+
WARNINGS: Too many partitions selected, doing full recomputation of incremental stats
Fetched 1 row(s) in 5871.22s

again, thank you for your help

Master Collaborator

Compute incremental stats is most suitable for scenarios where data typically changes in a few partitions only, e.g., adding partitions or appending to the latest partition, etc.

 

The first time you do COMPUTE INCREMENTAL STATS it will compute the incremental stats for all partitions. The next time, it will only compute the stats for partitions that have changed in the mean time. You can manually DROP INCREMENTAL STATS for a particular partition if you want to force re-computing stats for that partition in the next COMPUTE INCREMENTAL STATS.

 

Alternatively, you can also COMPUTE INCREMENTAL STATS for specific partitions only, so you can control how much work shoudl be done for computing stats.

 

Note that incremental stats have additional memory requirements on all daemons, so be sure to follow then guidelines in the Impala cookbook:

 

http://www.slideshare.net/cloudera/the-impala-cookbook-42530186

 

Expert Contributor

Alex,

 

Thank you for your reply.

I am aware of what you explained on your reply about `COMPUTE INCREMENTAL STATS`.

 

I have a question about the scenarios where data changes in a few partitions.

What do you mean by few partitions?

 

In my table, there are two level partitions, such as yearmonth/name_prefix .

Every month, new yearmonth is added and the new yearmonth has around 1360 name_prefix partition.

In this case, can I say there are a few partitions?

 

And, you mentioned that I can do `COMPUTE INCREMENTAL STATS` for specific partitions.

With this approach, I have to specify two level partitions with constant value,

for example (yearmonth=201512, name_prefix=ae). 

This means that I need to execute `COMPUTE INCREMENTAL STATS` around 1300 times since there are 1300 name_prefix partitions under yearmonth=201512.

 

I have seen a open ticket related to this. ( https://issues.cloudera.org/browse/IMPALA-1570 )

Maybe, after this ticket is resolved, I can do `COMPUTE INCREMENTAL STATS` only with the first level partition like (yearmonth=201512, name_prefix)

 

Question 1:

Based on what you said, the way I use `COMPUTE INCREMENTAL STATS` doesn't work.

It just do full computation. Am I right ?

 

Question 2:

With the current use case I have, is it better to use `COMPUTE STATS`?

 

Question 3:

While `COMPUTE STATS` operation, if user access the table being under `COMPUTE STATS`, what is the expected behaviors in Impala?

 

 

I really appreciate to your help 🙂

 

 

 

Master Collaborator

Thanks for the detailed explanation!

 

See my responses below:

 

The term "partition" usually means a complete partition specification, i.e., you can construct a path into HDFS that contains data files (and not folders for the next level partition values).

 

RE: Question 1

That's correct. It will do a full recomputation. However, it might be worth questioning your partitioning strategy. Typically we do not recommend more than 10k partitions per table. In extreme cases you might go up to 100k, but that is really stretching it, and you need to fully understand the consequences.

The 10k recommendation stems from limitations in Impala and the Hive Metastore in scaling up to an extreme number of partitions.

 

RE: Question 2

Probably yes. If you have an extreme number of partitions, then compute incremental stats is not recommended due to the additional memory requirements.

 

RE: Question 3

Doing concurrent reads should be fine. Doing concurrent writes with INSER INTO should also be fine as long as you do not OVERWRITE. However, there is no guarantee as to which partitions/files will be seen by COMPUTE STATS. It may see some but not other results from your INSERT.

 

Expert Contributor
Alex,

Thank you so much for your time and explanation.
All your comments are really valuable to me 🙂

Thank you again.

Expert Contributor

Alex,

 

I have a question.

Like I said, one of the tables I have 31k partitions. ( Table1 )

 

It seems the table performs ok.

However, since you said that more than 10k is not recommended per table, I have thought reducing the partitions.

Recently, I have read about bucketing in Hive and generated a testing table with partition for the first level and bucket for the secend level.

 

BTW, I got this warning from Impala.

WARNINGS: For better performance, snappy, gzip and bzip-compressed files should not be split into multiple hdfs-blocks. file=/user/hive/external_warehouse/test_table/yrmonth=201512/000151_0.snappy offset 134217728

 

 

Here are my questions.

 

Question 1:

In general, how is Impala query performace with Hive bucketing?

 

Question 2:

How to avoid to use cached Impala query?

 

I'd like to do some performance testing with this new table.

However, if I run the same query two times, it seems the second query use the cached data by the prev. one.

 

Thank you

Master Collaborator

1. Impala does not support bucketed tables.

 

2. Your first run is likely slow because Impala first needs to load the table metadata from the catalogd (as explain in another thread). After the first reference to the table, the metadata is cached. So when perf testing you should discard the first run, unless you specifically want to test performance on a cold cache (which seems weird).

Expert Contributor

Thank you,

 

1. Any plan to support bucketed tables?

 

2. Yeap. I remember what you said about catalogd loading metadata.

However, my question was that how to avoid to hit Impala cache if Impala caches query which ran before.

The reason why I'm asking this question is that I saw some difference in performace.

For example, let's say catalogd already has metadata for `Table1` because I already wamp up  the table by running `SELECT query`. After this, I run two different query like this.

 

I run QUERY1 - this takes 15sec

I run QUERY2 - this takes 20sec

I run QUERY1 again - this takes 6sec

I run QUERY2 again - this takes 10sec

 

 

How should I interpret this difference?

Master Collaborator

1. It's on our radar, but we don't have concrete plans yet.

 

2. Impala does not do query plan or result caching, so the differences you see are not due to Impala doing caching. There could be several reasons why you see a difference among runs, e.g., the OS buffer cache warming up The first step would be to study the runtime profiles of those queries. The exec summary is often enough to determine where time was saved the second time around,

 

Expert Contributor

Ah, I got it.

 

Tmr, I will try to compare the difference between first and second run by checking exec summary.

 

MANY Thank you.