Member since
03-11-2024
4
Posts
3
Kudos Received
0
Solutions
03-13-2024
06:49 AM
1 Kudo
It looks indeed that Hive StatsOptimizer seems to exclude external tables from getting their queries answered entirely by stats. So this rules out usage of fetch stats for simple aggregation queries such as min, max, etc. However, the optimizer can also use stats for query plan optimizing (e.g. join reordering) in complex queries. Are external tables excluded from these optimizations as well?
... View more
03-12-2024
04:29 AM
1 Kudo
Thank you @smruti. Is there no way to enable the use of statistics for external tables in Hive?
... View more
03-12-2024
03:16 AM
1 Kudo
@smruti Column stats are available indeed: {\"BASIC_STATS\":\"true\",\"COLUMN_STATS\":{\"number\":\"true\"}}. I extended the table size to 1gig+ (since hive.fetch.task.conversion.threshold=1073741824), and tried hive.fetch.task.aggr=true before doing the explain select for the max. It is unfortunately still doing a full TableScan as seen in the screenshot above. I would like to remark as well that we are talking about an external table here. When I put the same data in a managed table, the query plan makes use of stats as expected. So in summary, I don't understand why stats arent being used for this external table.
... View more
03-11-2024
06:18 AM
I have a parquet table named 'cats' holding 2 columns (day STRING, number INT). Statistics are generated both for the table and the 2 columns (e.g. max, min, distinct can be seen in DESCRIBE FORMATTED default.cats number). Still, when running an explain command for a select max, it looks like the statistics aren't being used. Why is this? How can Hive be configured to use the stats? All the following settings are true: set hive.cbo.enable; set hive.stats.autogather; set hive.compute.query.using.stats; set hive.stats.fetch.column.stats; set hive.stats.fetch.partition.stats; explain select max(number) from default.cats;
... View more
Labels:
- Labels:
-
Apache Hive