Support Questions

gaurang_n_shah · ‎10-01-2018

I am trying to understand the hive query plan for a simple distinct query and I have small confusion regrading output of one of the stage.

I have simple table with just two column, id and value. and just 4 rows as mentioned below.

Data:

<code>hive> select * from temp.test_distinct;
OK
1   100
2   100
3   100
4   150

Plan

<code>hive> explain select distinct value from temp.test_distinct;
OK
Plan not optimized by CBO.

Vertex dependency in root stage
Reducer 2 <- Map 1 (SIMPLE_EDGE)

Stage-0
   Fetch Operator
      limit:-1
      Stage-1
         Reducer 2
         File Output Operator [FS_6]
            compressed:false
            Statistics:Num rows: 2 Data size: 10 Basic stats: COMPLETE Column stats: NONE
            table:{"input format:":"org.apache.hadoop.mapred.TextInputFormat","output format:":"org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat","serde:":"org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe"}
            Group By Operator [GBY_4]
            |  keys:KEY._col0 (type: string)
            |  outputColumnNames:["_col0"]
            |  Statistics:Num rows: 2 Data size: 10 Basic stats: COMPLETE Column stats: NONE
            |<-Map 1 [SIMPLE_EDGE]
               Reduce Output Operator [RS_3]
                  key expressions:_col0 (type: string)
                  Map-reduce partition columns:_col0 (type: string)
                  sort order:+
                  Statistics:Num rows: 4 Data size: 20 Basic stats: COMPLETE Column stats: NONE
                  Group By Operator [GBY_2]
                     keys:value (type: string)
                     outputColumnNames:["_col0"]
                     Statistics:Num rows: 4 Data size: 20 Basic stats: COMPLETE Column stats: NONE
                     Select Operator [SEL_1]
                        outputColumnNames:["value"]
                        Statistics:Num rows: 4 Data size: 20 Basic stats: COMPLETE Column stats: NONE
                        TableScan [TS_0]
                           alias:test_distinct
                           Statistics:Num rows: 4 Data size: 20 Basic stats: COMPLETE Column stats: NONE

Time taken: 0.181 seconds, Fetched: 35 row(s)

Confusion:

TableScan, Select Operator and Group By Operator shows that they processed 4 rows which make sense to me. But shouldn't the next stage after Group By Operator get only 2 rows the process. As group by will remove other rows.

In my DAG I can see the output of mapper is just two rows and not four, however that doesn't see to match with Plan.

May i looking it wrong?

nramanaiah · ‎10-04-2018

Statistics estimation is approximated based on available stats.

In your case, i see basic stats are collected, but column stats are not collected. To estimate the number of distinct values for a column, can you run

analyze table temp.test_distinct compute statistics for columns;

and check explain plan.

Cloudera Community

Support Questions

Hive Query Plan - Understand input and output of stage.

Write Spark HQL Query output to HDFS

Understanding the mahout SSVD output!

Understanding Taxonomy in Apache Atlas

Understanding Kafka Consumer partition assignment ...

How to create excel file from Hive query output?

Hive insert query optimization

HIVE -K8S / failure to login: javax.security.auth...

Capture output from Hive action and use that as in...

HDFS Forecast & Capacity Planning

unable to write hive query output to s3