Support Questions

Find answers, ask questions, and share your expertise

Hive Query Plan - Understand input and output of stage.

avatar
Rising Star

I am trying to understand the hive query plan for a simple distinct query and I have small confusion regrading output of one of the stage.

I have simple table with just two column, id and value. and just 4 rows as mentioned below.

Data:

<code>hive> select * from temp.test_distinct;
OK
1   100
2   100
3   100
4   150

Plan

<code>hive> explain select distinct value from temp.test_distinct;
OK
Plan not optimized by CBO.

Vertex dependency in root stage
Reducer 2 <- Map 1 (SIMPLE_EDGE)

Stage-0
   Fetch Operator
      limit:-1
      Stage-1
         Reducer 2
         File Output Operator [FS_6]
            compressed:false
            Statistics:Num rows: 2 Data size: 10 Basic stats: COMPLETE Column stats: NONE
            table:{"input format:":"org.apache.hadoop.mapred.TextInputFormat","output format:":"org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat","serde:":"org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe"}
            Group By Operator [GBY_4]
            |  keys:KEY._col0 (type: string)
            |  outputColumnNames:["_col0"]
            |  Statistics:Num rows: 2 Data size: 10 Basic stats: COMPLETE Column stats: NONE
            |<-Map 1 [SIMPLE_EDGE]
               Reduce Output Operator [RS_3]
                  key expressions:_col0 (type: string)
                  Map-reduce partition columns:_col0 (type: string)
                  sort order:+
                  Statistics:Num rows: 4 Data size: 20 Basic stats: COMPLETE Column stats: NONE
                  Group By Operator [GBY_2]
                     keys:value (type: string)
                     outputColumnNames:["_col0"]
                     Statistics:Num rows: 4 Data size: 20 Basic stats: COMPLETE Column stats: NONE
                     Select Operator [SEL_1]
                        outputColumnNames:["value"]
                        Statistics:Num rows: 4 Data size: 20 Basic stats: COMPLETE Column stats: NONE
                        TableScan [TS_0]
                           alias:test_distinct
                           Statistics:Num rows: 4 Data size: 20 Basic stats: COMPLETE Column stats: NONE

Time taken: 0.181 seconds, Fetched: 35 row(s)

Confusion:

TableScan, Select Operator and Group By Operator shows that they processed 4 rows which make sense to me. But shouldn't the next stage after Group By Operator get only 2 rows the process. As group by will remove other rows.

In my DAG I can see the output of mapper is just two rows and not four, however that doesn't see to match with Plan.

May i looking it wrong?

1 REPLY 1

avatar
Expert Contributor

Statistics estimation is approximated based on available stats.

In your case, i see basic stats are collected, but column stats are not collected. To estimate the number of distinct values for a column, can you run

analyze table temp.test_distinct compute statistics for columns;

and check explain plan.