Created on 03-11-2024 06:18 AM - edited 03-11-2024 06:22 AM
I have a parquet table named 'cats' holding 2 columns (day STRING, number INT). Statistics are generated both for the table and the 2 columns (e.g. max, min, distinct can be seen in DESCRIBE FORMATTED default.cats number).
Still, when running an explain command for a select max, it looks like the statistics aren't being used. Why is this? How can Hive be configured to use the stats?
All the following settings are true:
set hive.cbo.enable;
set hive.stats.autogather;
set hive.compute.query.using.stats;
set hive.stats.fetch.column.stats;
set hive.stats.fetch.partition.stats;
explain select max(number) from default.cats;
Created 03-12-2024 04:00 AM
@Leopold I just checked. Your observation is correct. For external tables, it does not use a fetch task. In the logs, I see the following message:
2024-03-12 10:48:37,247 INFO org.apache.hadoop.hive.ql.optimizer.StatsOptimizer: [b226e7aa-9a42-4af3-b99b-be4a6592fb7f HiveServer2-Handler-Pool: Thread-31145]: Table t7 is external. Skip StatsOptimizer.
But enabling "hive.fetch.task.aggr=true" will help avoid the Reducer phase that is used for final aggregation. It will be a Map-only job.
Created 03-11-2024 01:54 PM
@Leopold Welcome to the Cloudera Community!
To help you get the best possible solution, I have tagged our Hive experts @mszurap @cravani who may be able to assist you further.
Please keep us updated on your post, and we hope you find a satisfactory solution to your query.
Regards,
Diana Torres,Created 03-12-2024 02:38 AM
@Leopold provided we have column stats available, Hive could use a fetch task to perform a simple aggregation task such as max(), instead of launching a Map job.
Try hive.fetch.task.aggr=true .
This property is disabled by default.
Created on 03-12-2024 03:16 AM - edited 03-12-2024 03:18 AM
@smruti Column stats are available indeed: {\"BASIC_STATS\":\"true\",\"COLUMN_STATS\":{\"number\":\"true\"}}.
I extended the table size to 1gig+ (since hive.fetch.task.conversion.threshold=1073741824), and tried hive.fetch.task.aggr=true before doing the explain select for the max. It is unfortunately still doing a full TableScan as seen in the screenshot above. I would like to remark as well that we are talking about an external table here. When I put the same data in a managed table, the query plan makes use of stats as expected.
So in summary, I don't understand why stats arent being used for this external table.
Created 03-12-2024 04:00 AM
@Leopold I just checked. Your observation is correct. For external tables, it does not use a fetch task. In the logs, I see the following message:
2024-03-12 10:48:37,247 INFO org.apache.hadoop.hive.ql.optimizer.StatsOptimizer: [b226e7aa-9a42-4af3-b99b-be4a6592fb7f HiveServer2-Handler-Pool: Thread-31145]: Table t7 is external. Skip StatsOptimizer.
But enabling "hive.fetch.task.aggr=true" will help avoid the Reducer phase that is used for final aggregation. It will be a Map-only job.
Created 03-12-2024 04:29 AM
Thank you @smruti. Is there no way to enable the use of statistics for external tables in Hive?
Created 03-12-2024 07:46 AM
@Leopold It is disabled for external tables as data in HDFS can change without Hive knowing about it. Unfortunately I do not see a way to enforce fetch task for a query with an aggregate function.
Created 03-13-2024 06:49 AM
It looks indeed that Hive StatsOptimizer seems to exclude external tables from getting their queries answered entirely by stats. So this rules out usage of fetch stats for simple aggregation queries such as min, max, etc.
However, the optimizer can also use stats for query plan optimizing (e.g. join reordering) in complex queries. Are external tables excluded from these optimizations as well?
Created 03-14-2024 11:58 PM
Hive does use stats from an external table in preparing query plan. When stats are accurate, it could estimate the size of intermediate data sets and select efficient join strategies. The only thing I noticed is the fetch task is not working.