About forhive

forhive · ‎11-29-2024

many thanks !

forhive · ‎11-29-2024

Regarding the fact that both external tables and internal tables use the same file location, and please forgive my improper usage. But I'm just testing. The part that confuses me is that after using the same MSCK REPAIR TABLE + ANALYZE TABLE commands on both tables, when I run the same COUNT() query, why does Hive apply query optimization for one but not for the other? What is the underlying principle behind Hive's internal design that leads to this distinction?

forhive · ‎11-28-2024

I have create two partitioned tables with same ddl, except for that one is internal table , another is external table 。Both has same file location (oss://xxxx/xxx )。 create table test_analyze( id int, name varchar(20)) partitioned by (dt string) ; create table test_analyze_external( id int, name varchar(20)) partitioned by (dt string) ; msck repair table test_analyze; analyze table test_analyze partition(dt='20241121') compute statistics; msck repair table test_analyze_external; analyze table test_analyze_external partition(dt='20241121') compute statistics; when I query count() for table 'test_analyze'， then quickly get the result , it seems fetch result from stats。 but when I send the same query to table 'test_analyze_external', it can't return stats result , and active a MR job。And it's sql-explain below： +----------------------------------------------------+ | Explain | +----------------------------------------------------+ | STAGE DEPENDENCIES: | | Stage-1 is a root stage | | Stage-0 depends on stages: Stage-1 | | | | STAGE PLANS: | | Stage: Stage-1 | | Map Reduce | | Map Operator Tree: | | TableScan | | alias: test_analyze_external | | Statistics: Num rows: 1 Data size: 3 Basic stats: COMPLETE Column stats: NONE | | Select Operator | | Statistics: Num rows: 1 Data size: 3 Basic stats: COMPLETE Column stats: NONE | | Group By Operator | | aggregations: count() | | mode: hash | | outputColumnNames: _col0 | | Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE | | File Output Operator | | compressed: false | | table: | | input format: org.apache.hadoop.mapred.SequenceFileInputFormat | | output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat | | serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe | | | | Stage: Stage-0 | | Fetch Operator | | limit: -1 | | Processor Tree: | | Group By Operator | | aggregations: count(_col0) | | mode: mergepartial | | outputColumnNames: _col0 | | Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE | | ListSink | | | And I'm sure of having the config " set hive.compute.query.using.stats = true "，so, why??? Who can explain？

Online	Offline
Last Visited	‎12-01-2024 02:34 PM

Member Since	‎11-28-2024 08:44 AM
Last Visited	‎12-01-2024 02:34 PM
Posts	3
Kudos received	1

Cloudera Community

Re: Why 'count()' query of hive external table ...

Re: Why 'count()' query of hive external table ...

Why 'count()' query of hive external table has ...