Member since
11-28-2024
3
Posts
1
Kudos Received
0
Solutions
11-29-2024
06:51 AM
many thanks !
... View more
11-29-2024
06:51 AM
Regarding the fact that both external tables and internal tables use the same file location, and please forgive my improper usage. But I'm just testing. The part that confuses me is that after using the same MSCK REPAIR TABLE + ANALYZE TABLE commands on both tables, when I run the same COUNT() query, why does Hive apply query optimization for one but not for the other? What is the underlying principle behind Hive's internal design that leads to this distinction?
... View more
11-28-2024
09:06 AM
1 Kudo
I have create two partitioned tables with same ddl, except for that one is internal table , another is external table 。Both has same file location (oss://xxxx/xxx )。 create table test_analyze( id int, name varchar(20)) partitioned by (dt string) ;
create table test_analyze_external( id int, name varchar(20)) partitioned by (dt string) ;
msck repair table test_analyze;
analyze table test_analyze partition(dt='20241121') compute statistics;
msck repair table test_analyze_external;
analyze table test_analyze_external partition(dt='20241121') compute statistics; when I query count() for table 'test_analyze', then quickly get the result , it seems fetch result from stats。 but when I send the same query to table 'test_analyze_external', it can't return stats result , and active a MR job。And it's sql-explain below: +----------------------------------------------------+
| Explain |
+----------------------------------------------------+
| STAGE DEPENDENCIES: |
| Stage-1 is a root stage |
| Stage-0 depends on stages: Stage-1 |
| |
| STAGE PLANS: |
| Stage: Stage-1 |
| Map Reduce |
| Map Operator Tree: |
| TableScan |
| alias: test_analyze_external |
| Statistics: Num rows: 1 Data size: 3 Basic stats: COMPLETE Column stats: NONE |
| Select Operator |
| Statistics: Num rows: 1 Data size: 3 Basic stats: COMPLETE Column stats: NONE |
| Group By Operator |
| aggregations: count() |
| mode: hash |
| outputColumnNames: _col0 |
| Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE |
| File Output Operator |
| compressed: false |
| table: |
| input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
| output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
| serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe |
| |
| Stage: Stage-0 |
| Fetch Operator |
| limit: -1 |
| Processor Tree: |
| Group By Operator |
| aggregations: count(_col0) |
| mode: mergepartial |
| outputColumnNames: _col0 |
| Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE |
| ListSink |
| | And I'm sure of having the config " set hive.compute.query.using.stats = true ",so, why??? Who can explain?
... View more
Labels:
- Labels:
-
Apache Hive