Support Questions

Find answers, ask questions, and share your expertise

Why 'count()' query of hive external table has different performance compared to internal table ?

avatar
New Contributor

 I have create two partitioned tables  with same ddl,  except for that one is internal table  , another is external table 。Both has same file location (oss://xxxx/xxx )。

 

 

create table test_analyze( id int, name varchar(20)) partitioned by (dt string) ; 
create table test_analyze_external( id int, name varchar(20)) partitioned by (dt string) ;


msck repair table test_analyze;
analyze table  test_analyze partition(dt='20241121') compute statistics; 

msck repair table test_analyze_external;
analyze table  test_analyze_external partition(dt='20241121') compute statistics; 

 

 

when I query  count() for table 'test_analyze', then quickly get  the result ,  it seems fetch result from stats。 but when I send the same query to table 'test_analyze_external',  it can't return stats result , and active a MR job。And  it's  sql-explain below:

 

 

+----------------------------------------------------+
|                      Explain                       |
+----------------------------------------------------+
| STAGE DEPENDENCIES:                                |
|   Stage-1 is a root stage                          |
|   Stage-0 depends on stages: Stage-1               |
|                                                    |
| STAGE PLANS:                                       |
|   Stage: Stage-1                                   |
|     Map Reduce                                     |
|       Map Operator Tree:                           |
|           TableScan                                |
|             alias: test_analyze_external           |
|             Statistics: Num rows: 1 Data size: 3 Basic stats: COMPLETE Column stats: NONE |
|             Select Operator                        |
|               Statistics: Num rows: 1 Data size: 3 Basic stats: COMPLETE Column stats: NONE |
|               Group By Operator                    |
|                 aggregations: count()              |
|                 mode: hash                         |
|                 outputColumnNames: _col0           |
|                 Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE |
|                 File Output Operator               |
|                   compressed: false                |
|                   table:                           |
|                       input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
|                       output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
|                       serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe |
|                                                    |
|   Stage: Stage-0                                   |
|     Fetch Operator                                 |
|       limit: -1                                    |
|       Processor Tree:                              |
|         Group By Operator                          |
|           aggregations: count(_col0)               |
|           mode: mergepartial                       |
|           outputColumnNames: _col0                 |
|           Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE |
|           ListSink                                 |
|                                                    |

 

 

And  I'm sure  of having the config " set hive.compute.query.using.stats = true  ",so, why???

Who can explain?

4 REPLIES 4

avatar
Community Manager

@forhive Welcome to the Cloudera Community!

To help you get the best possible solution, I have tagged our Hive experts @ggangadharan @james_jones  who may be able to assist you further.

Please keep us updated on your post, and we hope you find a satisfactory solution to your query.


Regards,

Diana Torres,
Community Moderator


Was your question answered? Make sure to mark the answer as the accepted solution.
If you find a reply useful, say thanks by clicking on the thumbs up button.
Learn more about the Cloudera Community:

avatar
New Contributor

many thanks  !

avatar
Master Collaborator
  • First of all , It is not recommended to use the same location for both internal and external tables.
  • Internal tables in Hive are native tables that are fully controlled by Hive itself.
  • External tables, on the other hand, can be accessed by other components such as Spark, Impala, and File system operations,.....etc.
  • Since External tables are used by other components, their corresponding locations need to be relied upon.
  • To read the files and obtain the count, Hive launches a MapReduce job for external tables.
  • It is recommended to use Managed tables if other components are not utilizing the corresponding table. 

avatar
New Contributor

Regarding the fact that both external tables and internal tables use the same file location,  and please forgive my improper usage.  But I'm just testing.

The part that confuses me is that after using the same MSCK REPAIR TABLE + ANALYZE TABLE commands on both tables, when I run the same COUNT() query, why does Hive apply query optimization for one but not for the other? What is the underlying principle behind Hive's internal design that leads to this distinction?