Support Questions
Find answers, ask questions, and share your expertise

Statistics in query explain is differ from actual data.

Explorer

Hello.

This is a follow up question of ( https://community.hortonworks.com/questions/39548/explanation-of-tez-task-counters.html )

When I run a explain for a query, statistics seems weird.

Due to the weird statistics, tez allocate only two reducers.

Explain

EXPLAIN SELECT user_no ,
         bn band_no ,
         pn post_no ,
         page_no ,
         count(1) pv_count ,
         ip is_preview FROM raw_scv_page_view_v2 
WHERE created_date = '20160612' 
AND user_no is not null GROUP BY user_no , bn , pn , page_no , ip;

Explain Result

Plan not optimized by CBO.

Vertex dependency in root stage
Reducer 2 <- Map 1 (SIMPLE_EDGE)

Stage-0
   Fetch Operator
      limit:-1
      Stage-1
         Reducer 2 vectorized
         File Output Operator [FS_2559]
            compressed:false
            Statistics:Num rows: 18 Data size: 216 Basic stats: COMPLETE Column stats: PARTIAL
            table:{"input format:":"org.apache.hadoop.mapred.TextInputFormat","output format:":"org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat","serde:":"org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe"}
            Select Operator [OP_2558]
               outputColumnNames:["_col0","_col1","_col2","_col3","_col4","_col5"]
               Statistics:Num rows: 18 Data size: 216 Basic stats: COMPLETE Column stats: PARTIAL
               Group By Operator [OP_2557]
               |  aggregations:["count(VALUE._col0)"]
               |  keys:KEY._col0 (type: bigint), KEY._col1 (type: int), KEY._col2 (type: bigint), KEY._col3 (type: int), KEY._col4 (type: string)
               |  outputColumnNames:["_col0","_col1","_col2","_col3","_col4","_col5"]
               |  Statistics:Num rows: 18 Data size: 216 Basic stats: COMPLETE Column stats: PARTIAL
               |<-Map 1 [SIMPLE_EDGE]
                  Reduce Output Operator [RS_2552]
                     key expressions:_col0 (type: bigint), _col1 (type: int), _col2 (type: bigint), _col3 (type: int), _col4 (type: string)
                     Map-reduce partition columns:_col0 (type: bigint), _col1 (type: int), _col2 (type: bigint), _col3 (type: int), _col4 (type: string)
                     sort order:+++++
                     Statistics:Num rows: 18 Data size: 216 Basic stats: COMPLETE Column stats: PARTIAL
                     value expressions:_col5 (type: bigint)
                     Group By Operator [GBY_2551]
                        aggregations:["count(1)"]
                        keys:user_no (type: bigint), bn (type: int), pn (type: bigint), page_no (type: int), ip (type: string)
                        outputColumnNames:["_col0","_col1","_col2","_col3","_col4","_col5"]
                        Statistics:Num rows: 18 Data size: 216 Basic stats: COMPLETE Column stats: PARTIAL
                        Select Operator [SEL_2550]
                           outputColumnNames:["user_no","bn","pn","page_no","ip"]
                           Statistics:Num rows: 3142 Data size: 12568 Basic stats: COMPLETE Column stats: PARTIAL
                           Filter Operator [FIL_2556]
                              predicate:user_no is not null (type: boolean)
                              Statistics:Num rows: 3142 Data size: 12568 Basic stats: COMPLETE Column stats: PARTIAL
                              TableScan [TS_2548]
                                 alias:raw_scv_page_view_v2
                                 Statistics:Num rows: 6284 Data size: 756247 Basic stats: COMPLETE Column stats: PARTIAL 

* The last line tells the size of default partition data.( WHERE created_date = '20160612')

Statistics:Num rows: 6284 Data size: 756247

But it is tooooo small than the partition actually have.Where this value comes from? table statistics?

by this value, tez will will allocate only two reducers ( too small )

(according to https://community.hortonworks.com/articles/22419/hive-on-tez-performance-tuning-determining-reducer....

Current Reducer setting

SET hive.exec.reducers.bytes.per.reducer
>> 1143891558

Actual row count

SELECT count(*) from raw_scv_page_view_v2 WHERE created_date = '20160612';
>> 316,626,655

SELECT count(*) from raw_scv_page_view_v2 WHERE created_date = '20160612' and user_no is not null;
_c0
>> 316,262,138

Actual data size ( x GB files in total exists )

/ data/ flume/ scv/ raw_scv_page_view/ 2016/ 06/ 12/ 1
raw_scv_page_view_2016-06-12_00_xflume101.band.nhnsystem.com.1465657201940.avro	19.3 MB	irteam	irteam	-rw-r--r--	June 12, 2016 01:01 AM
.........

Query tasks ( only two reducer used.. )

INFO : Tez session hasn't been created yet. Opening session
INFO : Dag name: SELECT user_no ,
bn band_no ,
...ip(Stage-1)
INFO : 
INFO : Status: Running (Executing on YARN cluster with App id application_1465874164014_0281)
INFO : Map 1: -/-	Reducer 2: 0/2	
INFO : Map 1: 0/68	Reducer 2: 0/2	
INFO : Map 1: 0(+3)/68	Reducer 2: 0/2	
INFO : Map 1: 0(+5)/68	Reducer 2: 0/2	
INFO : Map 1: 0(+8)/68	Reducer 2: 0/2	
INFO : Map 1: 0(+29)/68	Reducer 2: 0/2	
INFO : Map 1: 0(+58)/68	Reducer 2: 0/2	
INFO : Map 1: 0(+62)/68	Reducer 2: 0/2	
INFO : Map 1: 0(+65)/68	Reducer 2: 0/2	
INFO : Map 1: 0(+66)/68	Reducer 2: 0/2	
INFO : Map 1: 0(+68)/68	Reducer 2: 0/2	
INFO : Map 1: 0(+68)/68	Reducer 2: 0/2	
INFO : Map 1: 0(+68)/68	Reducer 2: 0/2	
INFO : Map 1: 0(+68)/68	Reducer 2: 0/2	
INFO : Map 1: 1(+67)/68	Reducer 2: 0/2	
INFO : Map 1: 1(+67)/68	Reducer 2: 0/2	
...
INFO : Map 1: 68/68	Reducer 2: 0(+2)/2	
INFO : Map 1: 68/68	Reducer 2: 0(+2)/2	
INFO : Map 1: 68/68	Reducer 2: 1(+1)/2	
INFO : Map 1: 68/68	Reducer 2: 2/2	

Table schema

show create table X
CREATE EXTERNAL TABLE `X`(......)
PARTITIONED BY ( 
  `created_date` string, 
  `page_no` int)
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.serde2.avro.AvroSerDe' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
LOCATION
  'hdfs://....../raw_scv_page_view'
TBLPROPERTIES (
  'avro.schema.url'='hdfs://......avsc', 
  'transient_lastDdlTime'='1441181032')

This table is external. and data written by flume stream in avro format.

Describe extended. Partition information

describe extended raw_scv_page_view_v2  partition(created_date='20160612',page_no=1)
Detailed Partition Information	Partition(values:[20160612, 1]
 dbName:default
 tableName:raw_scv_page_view_v2
 createTime:1465657597
 lastAccessTime:0
 sd:StorageDescriptor(cols:[]
 location:hdfs://....ew/2016/06/12/1
 inputFormat:org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat
 outputFormat:org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat
 compressed:false
 numBuckets:-1
 serdeInfo:SerDeInfo(name:null
 serializationLib:org.apache.hadoop.hive.serde2.avro.AvroSerDe
 parameters:{serialization.format=1})
 bucketCols:[]
 sortCols:[]
 parameters:{}
 skewedInfo:SkewedInfo(skewedColNames:[]
 skewedColValues:[]
 skewedColValueLocationMaps:{})
 storedAsSubDirectories:false)
 parameters:{totalSize=30146
 numRows=-1
 rawDataSize=-1
 COLUMN_STATS_ACCURATE=false
 numFiles=6
 transient_lastDdlTime=1465657597})	

* the rawDataSize and numRows of describe are related to this problem..?

I know this might not enough detail question. but could please give any advice..

Why explain show too small estimates and tez lead to allocate small reducers?

3 REPLIES 3

Re: Statistics in query explain is differ from actual data.

Explorer

Update.

I suspected the table statistics, so analyzed all partitions again.

analyze table raw_scv_page_view_v2  partition(created_date='20160612',page_no=...) compute statistics noscan;

Then ran the query again.

The initial num rows is increased, but the last operation's statistics still looks wrong.

Vertex dependency in root stage
Reducer 2 <- Map 1 (SIMPLE_EDGE)
Stage-0
   Fetch Operator
      limit:-1
      Stage-1
         Reducer 2 vectorized
         File Output Operator [FS_11]
            compressed:false
            Statistics:Num rows: 18 Data size: 216 Basic stats: COMPLETE Column stats: PARTIAL
            table:{"input format:":"org.apache.hadoop.mapred.TextInputFormat","output format:":"org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat","serde:":"org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe"}
            Select Operator [OP_10]
               outputColumnNames:["_col0","_col1","_col2","_col3","_col4","_col5"]
               Statistics:Num rows: 18 Data size: 216 Basic stats: COMPLETE Column stats: PARTIAL
               Group By Operator [OP_9]
               |  aggregations:["count(VALUE._col0)"]
               |  keys:KEY._col0 (type: bigint), KEY._col1 (type: int), KEY._col2 (type: bigint), KEY._col3 (type: int), KEY._col4 (type: string)
               |  outputColumnNames:["_col0","_col1","_col2","_col3","_col4","_col5"]
 
               // STILL too low value..
               |  Statistics:Num rows: 18 Data size: 216 Basic stats: COMPLETE Column stats: PARTIAL
               |<-Map 1 [SIMPLE_EDGE]
                  Reduce Output Operator [RS_4]
                     key expressions:_col0 (type: bigint), _col1 (type: int), _col2 (type: bigint), _col3 (type: int), _col4 (type: string)
                     Map-reduce partition columns:_col0 (type: bigint), _col1 (type: int), _col2 (type: bigint), _col3 (type: int), _col4 (type: string)
                     sort order:+++++
                     Statistics:Num rows: 1098 Data size: 13176 Basic stats: COMPLETE Column stats: PARTIAL
                     value expressions:_col5 (type: bigint)
                     Group By Operator [GBY_3]
                        aggregations:["count(1)"]
                        keys:user_no (type: bigint), bn (type: int), pn (type: bigint), page_no (type: int), ip (type: string)
                        outputColumnNames:["_col0","_col1","_col2","_col3","_col4","_col5"]
                        Statistics:Num rows: 1098 Data size: 13176 Basic stats: COMPLETE Column stats: PARTIAL
                        Select Operator [SEL_2]
                           outputColumnNames:["user_no","bn","pn","page_no","ip"]
                           Statistics:Num rows: 64375934 Data size: 257503736 Basic stats: COMPLETE Column stats: PARTIAL
                           Filter Operator [FIL_8]
                              predicate:user_no is not null (type: boolean)
                              Statistics:Num rows: 64375934 Data size: 257503736 Basic stats: COMPLETE Column stats: PARTIAL
                              TableScan [TS_0]
                                 alias:raw_scv_page_view_v2
                                 // INCREASED !!
                                 Statistics:Num rows: 128751868 Data size: 15450226170 Basic stats: COMPLETE Column stats: PARTIAL

Re: Statistics in query explain is differ from actual data.

Super Guru

Any problems with the types? Everything look okay. Any chance to move data to Orc?

Re: Statistics in query explain is differ from actual data.

Explorer

It was due to the configuration "hive.stats.fetch.column.stats=true", but column statistics was not updated to metastore.

when after set hive.stats.fetch.column.stats to false, reducer task count become make sense.