Created on 12-30-2015 01:28 PM - edited 09-16-2022 02:55 AM
Hello,
I am researching how to statistics to optimize queries.
From I read from here, http://www.cloudera.com/content/www/en-us/documentation/enterprise/latest/topics/impala_perf_stats.h...
Although Impala recommends to use `COMPUTE STATS` or `COMPUTE INCREMENTAL STATS` depending on use cases,
I can still generate statistics from Hive by using `ANALYZE TABLE ... COMPUTE STATISTICS` and/or `ANALYZE TABLE ... COMPUTE STATISTICS FOR COLUMNS;`
After I generate statistics by `ANALYZE TABLE ... COMPUTE STATISTICS` on Hive,
I could see statistics for table and partition on Impala by running `show table stats Table1`.
However, I couldn't see any updated column statistics even after `ANALYZE TABLE ... COMPUTE STATISTICS FOR COLUMNS;`.
Of course, I used `SHOW COLUMN STATS Tables1`.
Is this a bug or am I missing anything?
Thank you for your help in advance.
Created 12-30-2015 04:19 PM
Ok, looks like I was wrong in assuming that Hive would compute the column stats on a table level.
For a partitioned table, Hive's ANALYZE TABLE command will compute the column stats on a per-partition basis.
It's not clear that this approach even makes sense because how will one then aggregate the different distinct-value stats across partitions? Seems like those stats would be wildly inaccurate, so maybe this is not a good flow anyway, even if we could make it work.
That's why the stats do not show up in Impala. The flow of computing column stats in Hive and then using them in Impala will currently not work for partitioned tables.
Created 01-04-2016 06:58 PM
Thanks for the detailed explanation!
See my responses below:
The term "partition" usually means a complete partition specification, i.e., you can construct a path into HDFS that contains data files (and not folders for the next level partition values).
RE: Question 1
That's correct. It will do a full recomputation. However, it might be worth questioning your partitioning strategy. Typically we do not recommend more than 10k partitions per table. In extreme cases you might go up to 100k, but that is really stretching it, and you need to fully understand the consequences.
The 10k recommendation stems from limitations in Impala and the Hive Metastore in scaling up to an extreme number of partitions.
RE: Question 2
Probably yes. If you have an extreme number of partitions, then compute incremental stats is not recommended due to the additional memory requirements.
RE: Question 3
Doing concurrent reads should be fine. Doing concurrent writes with INSER INTO should also be fine as long as you do not OVERWRITE. However, there is no guarantee as to which partitions/files will be seen by COMPUTE STATS. It may see some but not other results from your INSERT.
Created 12-30-2015 03:39 PM
Hi!
your scenario should work. Did you do "invalidate metadata <table>" in Impala after computing the stats in Hive?
Also, Impala only deals with column stats at the table level, so if you compute the column stats for a specific partition in Hive, then those stats will not show up in Impala.
Created on 12-30-2015 03:45 PM - edited 12-30-2015 04:03 PM
Thank you for your feedback.
Yes. I did `INVALIDATE METADATA <table_name>`;
As I mentioned in the questoin, I can see the updated table statistics by `show table stats <table_name>` after executing `ANALYZE TABLE Table1 PARTITION(yearmonth) COMPUTE STATISTICS;`
Query: show table stats products +-----------+----------+--------+----------+--------------+-------------------+---------+-------------------+ | yearmonth | #Rows | #Files | Size | Bytes Cached | Cache Replication | Format | Incremental stats | +-----------+----------+--------+----------+--------------+-------------------+---------+-------------------+ | 201404 | 48799023 | 9 | 866.37MB | NOT CACHED | NOT CACHED | RC_FILE | false | | 201405 | 54633812 | 11 | 968.30MB | NOT CACHED | NOT CACHED | RC_FILE | false | | 201406 | 49516351 | 11 | 873.53MB | NOT CACHED | NOT CACHED | RC_FILE | false | | 201407 | 51782891 | 12 | 934.59MB | NOT CACHED | NOT CACHED | RC_FILE | false | | 201408 | 73191206 | 13 | 1.23GB | NOT CACHED | NOT CACHED | RC_FILE | false | | 201409 | 76577223 | 17 | 1.31GB | NOT CACHED | NOT CACHED | RC_FILE | false | | 201410 | 78462480 | 17 | 1.37GB | NOT CACHED | NOT CACHED | RC_FILE | false | | 201411 | 78778172 | 17 | 1.37GB | NOT CACHED | NOT CACHED | RC_FILE | false | | 201412 | 78778304 | 17 | 1.37GB | NOT CACHED | NOT CACHED | RC_FILE | false | | 201501 | 78914761 | 17 | 1.37GB | NOT CACHED | NOT CACHED | RC_FILE | false | | 201502 | 79112909 | 17 | 1.36GB | NOT CACHED | NOT CACHED | RC_FILE | false | | 201503 | 79270403 | 17 | 1.37GB | NOT CACHED | NOT CACHED | RC_FILE | false | | 201504 | 79315850 | 17 | 1.37GB | NOT CACHED | NOT CACHED | RC_FILE | false | | 201505 | 79626491 | 17 | 1.37GB | NOT CACHED | NOT CACHED | RC_FILE | false | | 201506 | 79644598 | 17 | 1.42GB | NOT CACHED | NOT CACHED | RC_FILE | false | | 201507 | 79741074 | 17 | 1.42GB | NOT CACHED | NOT CACHED | RC_FILE | false | | 201508 | 79798934 | 18 | 1.43GB | NOT CACHED | NOT CACHED | RC_FILE | false | | 201509 | 79920969 | 18 | 1.43GB | NOT CACHED | NOT CACHED | RC_FILE | false | | 201510 | 79950252 | 18 | 1.43GB | NOT CACHED | NOT CACHED | RC_FILE | false | | 201511 | 79965601 | 18 | 1.44GB | NOT CACHED | NOT CACHED | RC_FILE | false | | 201512 | 80038743 | 5 | 1.41GB | NOT CACHED | NOT CACHED | RC_FILE | false | | Total | -1 | 320 | 27.03GB | 0B | | | | +-----------+----------+--------+----------+--------------+-------------------+---------+-------------------+
However, I can't see the updated column statistics by `show column stats <table_name>` after executing `ANALYZE TABLE Table1 COMPUTE STATISTICS FOR COLUMNS;`
Query: show column stats products +-----------------+--------+------------------+--------+----------+----------+ | Column | Type | #Distinct Values | #Nulls | Max Size | Avg Size | +-----------------+--------+------------------+--------+----------+----------+ | product_name | STRING | -1 | -1 | -1 | -1 | | token_count | INT | -1 | -1 | 4 | 4 | | currency | STRING | -1 | -1 | -1 | -1 | | yearmonth | INT | 21 | 0 | 4 | 4 | +-----------------+--------+------------------+--------+----------+----------+
Of course, I execute `INVALIDATE METADATA <table_name>` statistics in Impala.
Thank you.
Created 12-30-2015 04:09 PM
Thanks.
I can reproduce the problem on a partitioned table (unpartitioned works), give me some time to look into it.
Created 12-30-2015 04:10 PM
Created 12-30-2015 04:19 PM
Ok, looks like I was wrong in assuming that Hive would compute the column stats on a table level.
For a partitioned table, Hive's ANALYZE TABLE command will compute the column stats on a per-partition basis.
It's not clear that this approach even makes sense because how will one then aggregate the different distinct-value stats across partitions? Seems like those stats would be wildly inaccurate, so maybe this is not a good flow anyway, even if we could make it work.
That's why the stats do not show up in Impala. The flow of computing column stats in Hive and then using them in Impala will currently not work for partitioned tables.
Created on 12-30-2015 04:27 PM - edited 12-30-2015 04:28 PM
Thank you for your investigation.
I am using `COMPUTE INCREMENTAL STATS` for the table which I add new yearmonth partition every month.
However, `COMPUTE INCREMENTAL STATS` takes about 2hr and it takes longer and longer.
To me, it is because `COMPUTE INCREMENTAL STATS` does two operation for the given table.
1. counting the total number of rows.
2. doing full recomputation of incremental stats if too many partitions are selected.
I'm looking for some other ways to reduce time for gathering statistics.
This is log I get during `COMPUTE INCREMENTAL STATS`
+---------------------------------------------+ | summary | +---------------------------------------------+ | Updated 1372 partition(s) and 29 column(s). | +---------------------------------------------+ WARNINGS: Too many partitions selected, doing full recomputation of incremental stats Fetched 1 row(s) in 5871.22s
again, thank you for your help
Created 01-04-2016 04:46 PM
Compute incremental stats is most suitable for scenarios where data typically changes in a few partitions only, e.g., adding partitions or appending to the latest partition, etc.
The first time you do COMPUTE INCREMENTAL STATS it will compute the incremental stats for all partitions. The next time, it will only compute the stats for partitions that have changed in the mean time. You can manually DROP INCREMENTAL STATS for a particular partition if you want to force re-computing stats for that partition in the next COMPUTE INCREMENTAL STATS.
Alternatively, you can also COMPUTE INCREMENTAL STATS for specific partitions only, so you can control how much work shoudl be done for computing stats.
Note that incremental stats have additional memory requirements on all daemons, so be sure to follow then guidelines in the Impala cookbook:
http://www.slideshare.net/cloudera/the-impala-cookbook-42530186
Created 01-04-2016 05:14 PM
Alex,
Thank you for your reply.
I am aware of what you explained on your reply about `COMPUTE INCREMENTAL STATS`.
I have a question about the scenarios where data changes in a few partitions.
What do you mean by few partitions?
In my table, there are two level partitions, such as yearmonth/name_prefix .
Every month, new yearmonth is added and the new yearmonth has around 1360 name_prefix partition.
In this case, can I say there are a few partitions?
And, you mentioned that I can do `COMPUTE INCREMENTAL STATS` for specific partitions.
With this approach, I have to specify two level partitions with constant value,
for example (yearmonth=201512, name_prefix=ae).
This means that I need to execute `COMPUTE INCREMENTAL STATS` around 1300 times since there are 1300 name_prefix partitions under yearmonth=201512.
I have seen a open ticket related to this. ( https://issues.cloudera.org/browse/IMPALA-1570 )
Maybe, after this ticket is resolved, I can do `COMPUTE INCREMENTAL STATS` only with the first level partition like (yearmonth=201512, name_prefix)
Question 1:
Based on what you said, the way I use `COMPUTE INCREMENTAL STATS` doesn't work.
It just do full computation. Am I right ?
Question 2:
With the current use case I have, is it better to use `COMPUTE STATS`?
Question 3:
While `COMPUTE STATS` operation, if user access the table being under `COMPUTE STATS`, what is the expected behaviors in Impala?
I really appreciate to your help 🙂
Created 01-04-2016 06:58 PM
Thanks for the detailed explanation!
See my responses below:
The term "partition" usually means a complete partition specification, i.e., you can construct a path into HDFS that contains data files (and not folders for the next level partition values).
RE: Question 1
That's correct. It will do a full recomputation. However, it might be worth questioning your partitioning strategy. Typically we do not recommend more than 10k partitions per table. In extreme cases you might go up to 100k, but that is really stretching it, and you need to fully understand the consequences.
The 10k recommendation stems from limitations in Impala and the Hive Metastore in scaling up to an extreme number of partitions.
RE: Question 2
Probably yes. If you have an extreme number of partitions, then compute incremental stats is not recommended due to the additional memory requirements.
RE: Question 3
Doing concurrent reads should be fine. Doing concurrent writes with INSER INTO should also be fine as long as you do not OVERWRITE. However, there is no guarantee as to which partitions/files will be seen by COMPUTE STATS. It may see some but not other results from your INSERT.