Created on 11-28-2024 09:06 AM - edited 11-28-2024 09:13 AM
I have create two partitioned tables with same ddl, except for that one is internal table , another is external table 。Both has same file location (oss://xxxx/xxx )。
create table test_analyze( id int, name varchar(20)) partitioned by (dt string) ;
create table test_analyze_external( id int, name varchar(20)) partitioned by (dt string) ;
msck repair table test_analyze;
analyze table test_analyze partition(dt='20241121') compute statistics;
msck repair table test_analyze_external;
analyze table test_analyze_external partition(dt='20241121') compute statistics;
when I query count() for table 'test_analyze', then quickly get the result , it seems fetch result from stats。 but when I send the same query to table 'test_analyze_external', it can't return stats result , and active a MR job。And it's sql-explain below:
+----------------------------------------------------+
| Explain |
+----------------------------------------------------+
| STAGE DEPENDENCIES: |
| Stage-1 is a root stage |
| Stage-0 depends on stages: Stage-1 |
| |
| STAGE PLANS: |
| Stage: Stage-1 |
| Map Reduce |
| Map Operator Tree: |
| TableScan |
| alias: test_analyze_external |
| Statistics: Num rows: 1 Data size: 3 Basic stats: COMPLETE Column stats: NONE |
| Select Operator |
| Statistics: Num rows: 1 Data size: 3 Basic stats: COMPLETE Column stats: NONE |
| Group By Operator |
| aggregations: count() |
| mode: hash |
| outputColumnNames: _col0 |
| Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE |
| File Output Operator |
| compressed: false |
| table: |
| input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
| output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
| serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe |
| |
| Stage: Stage-0 |
| Fetch Operator |
| limit: -1 |
| Processor Tree: |
| Group By Operator |
| aggregations: count(_col0) |
| mode: mergepartial |
| outputColumnNames: _col0 |
| Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE |
| ListSink |
| |
And I'm sure of having the config " set hive.compute.query.using.stats = true ",so, why???
Who can explain?
Created 11-28-2024 09:47 AM
@forhive Welcome to the Cloudera Community!
To help you get the best possible solution, I have tagged our Hive experts @ggangadharan @james_jones who may be able to assist you further.
Please keep us updated on your post, and we hope you find a satisfactory solution to your query.
Regards,
Diana Torres,Created 11-29-2024 06:51 AM
many thanks !
Created 11-28-2024 11:47 PM
Created 11-29-2024 06:51 AM
Regarding the fact that both external tables and internal tables use the same file location, and please forgive my improper usage. But I'm just testing.
The part that confuses me is that after using the same MSCK REPAIR TABLE + ANALYZE TABLE commands on both tables, when I run the same COUNT() query, why does Hive apply query optimization for one but not for the other? What is the underlying principle behind Hive's internal design that leads to this distinction?