Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Hive query plan not using generated statistics

avatar
Explorer

screenshot.PNGI have a parquet table named 'cats' holding 2 columns (day STRING, number INT). Statistics are generated both for the table and the 2 columns (e.g. max, min, distinct can be seen in DESCRIBE FORMATTED default.cats number).

Still, when running an explain command for a select max, it looks like the statistics aren't being used. Why is this? How can Hive be configured to use the stats?

All the following settings are true:
set hive.cbo.enable;
set hive.stats.autogather;
set hive.compute.query.using.stats;
set hive.stats.fetch.column.stats;
set hive.stats.fetch.partition.stats;

explain select max(number) from default.cats;

 

1 ACCEPTED SOLUTION

avatar
Master Collaborator

@Leopold I just checked. Your observation is correct. For external tables, it does not use a fetch task. In the logs, I see the following message:

2024-03-12 10:48:37,247 INFO  org.apache.hadoop.hive.ql.optimizer.StatsOptimizer: [b226e7aa-9a42-4af3-b99b-be4a6592fb7f HiveServer2-Handler-Pool: Thread-31145]: Table t7 is external. Skip StatsOptimizer.

But enabling  "hive.fetch.task.aggr=true" will help avoid the Reducer phase that is used for final aggregation. It will be a Map-only job.

View solution in original post

8 REPLIES 8

avatar
Community Manager

@Leopold Welcome to the Cloudera Community!

To help you get the best possible solution, I have tagged our Hive experts @mszurap @cravani  who may be able to assist you further.

Please keep us updated on your post, and we hope you find a satisfactory solution to your query.


Regards,

Diana Torres,
Community Moderator


Was your question answered? Make sure to mark the answer as the accepted solution.
If you find a reply useful, say thanks by clicking on the thumbs up button.
Learn more about the Cloudera Community:

avatar
Master Collaborator

@Leopold provided we have column stats available, Hive could use a fetch task to perform a simple aggregation task such as max(), instead of launching a Map job.

Try hive.fetch.task.aggr=true .

This property is disabled by default.

avatar
Explorer

@smruti Column stats are available indeed: {\"BASIC_STATS\":\"true\",\"COLUMN_STATS\":{\"number\":\"true\"}}.
I extended the table size to 1gig+ (since hive.fetch.task.conversion.threshold=1073741824), and tried hive.fetch.task.aggr=true before doing the explain select for the max. It is unfortunately still doing a full TableScan as seen in the screenshot above. I would like to remark as well that we are talking about an external table here. When I put the same data in a managed table, the query plan makes use of stats as expected.
So in summary, I don't understand why stats arent being used for this external table.

 

avatar
Master Collaborator

@Leopold I just checked. Your observation is correct. For external tables, it does not use a fetch task. In the logs, I see the following message:

2024-03-12 10:48:37,247 INFO  org.apache.hadoop.hive.ql.optimizer.StatsOptimizer: [b226e7aa-9a42-4af3-b99b-be4a6592fb7f HiveServer2-Handler-Pool: Thread-31145]: Table t7 is external. Skip StatsOptimizer.

But enabling  "hive.fetch.task.aggr=true" will help avoid the Reducer phase that is used for final aggregation. It will be a Map-only job.

avatar
Explorer

Thank you @smruti. Is there no way to enable the use of statistics for external tables in Hive?

avatar
Master Collaborator

@Leopold It is disabled for external tables as data in HDFS can change without Hive knowing about it.  Unfortunately I do not see a way to enforce fetch task for a query with an aggregate function. 

avatar
Explorer

It looks indeed that Hive StatsOptimizer seems to exclude external tables from getting their queries answered entirely by stats. So this rules out usage of fetch stats for simple aggregation queries such as min, max, etc.
However, the optimizer can also use stats for query plan optimizing (e.g. join reordering) in complex queries. Are external tables excluded from these optimizations as well?

avatar
Master Collaborator

Hive does use stats from an external table in preparing query plan. When stats are accurate, it could estimate the size of intermediate data sets and select efficient join strategies. The only thing I noticed is the fetch task is not working.