- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
Hive Query Plan - Understand input and output of stage.
- Labels:
-
Apache Hadoop
-
Apache Hive
Created ‎10-01-2018 07:30 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I am trying to understand the hive query plan for a simple distinct
query and I have small confusion regrading output of one of the stage.
I have simple table with just two column, id and value. and just 4 rows as mentioned below.
Data:
<code>hive> select * from temp.test_distinct; OK 1 100 2 100 3 100 4 150
Plan
<code>hive> explain select distinct value from temp.test_distinct; OK Plan not optimized by CBO. Vertex dependency in root stage Reducer 2 <- Map 1 (SIMPLE_EDGE) Stage-0 Fetch Operator limit:-1 Stage-1 Reducer 2 File Output Operator [FS_6] compressed:false Statistics:Num rows: 2 Data size: 10 Basic stats: COMPLETE Column stats: NONE table:{"input format:":"org.apache.hadoop.mapred.TextInputFormat","output format:":"org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat","serde:":"org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe"} Group By Operator [GBY_4] | keys:KEY._col0 (type: string) | outputColumnNames:["_col0"] | Statistics:Num rows: 2 Data size: 10 Basic stats: COMPLETE Column stats: NONE |<-Map 1 [SIMPLE_EDGE] Reduce Output Operator [RS_3] key expressions:_col0 (type: string) Map-reduce partition columns:_col0 (type: string) sort order:+ Statistics:Num rows: 4 Data size: 20 Basic stats: COMPLETE Column stats: NONE Group By Operator [GBY_2] keys:value (type: string) outputColumnNames:["_col0"] Statistics:Num rows: 4 Data size: 20 Basic stats: COMPLETE Column stats: NONE Select Operator [SEL_1] outputColumnNames:["value"] Statistics:Num rows: 4 Data size: 20 Basic stats: COMPLETE Column stats: NONE TableScan [TS_0] alias:test_distinct Statistics:Num rows: 4 Data size: 20 Basic stats: COMPLETE Column stats: NONE Time taken: 0.181 seconds, Fetched: 35 row(s)
Confusion:
TableScan, Select Operator and Group By Operator shows that they processed 4 rows which make sense to me. But shouldn't the next stage after Group By Operator get only 2 rows the process. As group by will remove other rows.
In my DAG I can see the output of mapper is just two rows and not four, however that doesn't see to match with Plan.
May i looking it wrong?
Created ‎10-04-2018 11:54 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Statistics estimation is approximated based on available stats.
In your case, i see basic stats are collected, but column stats are not collected. To estimate the number of distinct values for a column, can you run
analyze table temp.test_distinct compute statistics for columns;
and check explain plan.
