Support Questions

Leopold · ‎03-11-2024

I have a parquet table named 'cats' holding 2 columns (day STRING, number INT). Statistics are generated both for the table and the 2 columns (e.g. max, min, distinct can be seen in DESCRIBE FORMATTED default.cats number).

Still, when running an explain command for a select max, it looks like the statistics aren't being used. Why is this? How can Hive be configured to use the stats?

All the following settings are true:
set hive.cbo.enable;
set hive.stats.autogather;
set hive.compute.query.using.stats;
set hive.stats.fetch.column.stats;
set hive.stats.fetch.partition.stats;

explain select max(number) from default.cats;

smruti · ‎03-12-2024

@Leopold I just checked. Your observation is correct. For external tables, it does not use a fetch task. In the logs, I see the following message:

2024-03-12 10:48:37,247 INFO  org.apache.hadoop.hive.ql.optimizer.StatsOptimizer: [b226e7aa-9a42-4af3-b99b-be4a6592fb7f HiveServer2-Handler-Pool: Thread-31145]: Table t7 is external. Skip StatsOptimizer.

But enabling "hive.fetch.task.aggr=true" will help avoid the Reducer phase that is used for final aggregation. It will be a Map-only job.

View solution in original post

DianaTorres · ‎03-11-2024

@Leopold Welcome to the Cloudera Community!

To help you get the best possible solution, I have tagged our Hive experts @mszurap @cravani who may be able to assist you further.

Please keep us updated on your post, and we hope you find a satisfactory solution to your query.

Regards,

Diana Torres,
Senior Community Moderator

Was your question answered? Make sure to mark the answer as the accepted solution.
If you find a reply useful, say thanks by clicking on the thumbs up button.
Learn more about the Cloudera Community:
Community Guidelines
How to use the forum

smruti · ‎03-12-2024

@Leopold provided we have column stats available, Hive could use a fetch task to perform a simple aggregation task such as max(), instead of launching a Map job.

Try hive.fetch.task.aggr=true .

This property is disabled by default.

Leopold · ‎03-12-2024

@smruti Column stats are available indeed: {\"BASIC_STATS\":\"true\",\"COLUMN_STATS\":{\"number\":\"true\"}}.
I extended the table size to 1gig+ (since hive.fetch.task.conversion.threshold=1073741824), and tried hive.fetch.task.aggr=true before doing the explain select for the max. It is unfortunately still doing a full TableScan as seen in the screenshot above. I would like to remark as well that we are talking about an external table here. When I put the same data in a managed table, the query plan makes use of stats as expected.
So in summary, I don't understand why stats arent being used for this external table.

smruti · ‎03-12-2024

@Leopold I just checked. Your observation is correct. For external tables, it does not use a fetch task. In the logs, I see the following message:

2024-03-12 10:48:37,247 INFO  org.apache.hadoop.hive.ql.optimizer.StatsOptimizer: [b226e7aa-9a42-4af3-b99b-be4a6592fb7f HiveServer2-Handler-Pool: Thread-31145]: Table t7 is external. Skip StatsOptimizer.

But enabling "hive.fetch.task.aggr=true" will help avoid the Reducer phase that is used for final aggregation. It will be a Map-only job.

Leopold · ‎03-12-2024

Thank you @smruti. Is there no way to enable the use of statistics for external tables in Hive?

smruti · ‎03-12-2024

@Leopold It is disabled for external tables as data in HDFS can change without Hive knowing about it. Unfortunately I do not see a way to enforce fetch task for a query with an aggregate function.

Leopold · ‎03-13-2024

It looks indeed that Hive StatsOptimizer seems to exclude external tables from getting their queries answered entirely by stats. So this rules out usage of fetch stats for simple aggregation queries such as min, max, etc.
However, the optimizer can also use stats for query plan optimizing (e.g. join reordering) in complex queries. Are external tables excluded from these optimizations as well?

smruti · ‎03-14-2024

Hive does use stats from an external table in preparing query plan. When stats are accurate, it could estimate the size of intermediate data sets and select efficient join strategies. The only thing I noticed is the fetch task is not working.

Support Questions

Hive query plan not using generated statistics