Support Questions

# Statistics in query explain is differ from actual data.

Explorer

Hello.

When I run a explain for a query, statistics seems weird.

Due to the weird statistics, tez allocate only two reducers.

Explain

```EXPLAIN SELECT user_no ,
bn band_no ,
pn post_no ,
page_no ,
count(1) pv_count ,
ip is_preview FROM raw_scv_page_view_v2
WHERE created_date = '20160612'
AND user_no is not null GROUP BY user_no , bn , pn , page_no , ip;
```

Explain Result

```Plan not optimized by CBO.

Vertex dependency in root stage
Reducer 2 <- Map 1 (SIMPLE_EDGE)

Stage-0
Fetch Operator
limit:-1
Stage-1
Reducer 2 vectorized
File Output Operator [FS_2559]
compressed:false
Statistics:Num rows: 18 Data size: 216 Basic stats: COMPLETE Column stats: PARTIAL
Select Operator [OP_2558]
outputColumnNames:["_col0","_col1","_col2","_col3","_col4","_col5"]
Statistics:Num rows: 18 Data size: 216 Basic stats: COMPLETE Column stats: PARTIAL
Group By Operator [OP_2557]
|  aggregations:["count(VALUE._col0)"]
|  keys:KEY._col0 (type: bigint), KEY._col1 (type: int), KEY._col2 (type: bigint), KEY._col3 (type: int), KEY._col4 (type: string)
|  outputColumnNames:["_col0","_col1","_col2","_col3","_col4","_col5"]
|  Statistics:Num rows: 18 Data size: 216 Basic stats: COMPLETE Column stats: PARTIAL
|<-Map 1 [SIMPLE_EDGE]
Reduce Output Operator [RS_2552]
key expressions:_col0 (type: bigint), _col1 (type: int), _col2 (type: bigint), _col3 (type: int), _col4 (type: string)
Map-reduce partition columns:_col0 (type: bigint), _col1 (type: int), _col2 (type: bigint), _col3 (type: int), _col4 (type: string)
sort order:+++++
Statistics:Num rows: 18 Data size: 216 Basic stats: COMPLETE Column stats: PARTIAL
value expressions:_col5 (type: bigint)
Group By Operator [GBY_2551]
aggregations:["count(1)"]
keys:user_no (type: bigint), bn (type: int), pn (type: bigint), page_no (type: int), ip (type: string)
outputColumnNames:["_col0","_col1","_col2","_col3","_col4","_col5"]
Statistics:Num rows: 18 Data size: 216 Basic stats: COMPLETE Column stats: PARTIAL
Select Operator [SEL_2550]
outputColumnNames:["user_no","bn","pn","page_no","ip"]
Statistics:Num rows: 3142 Data size: 12568 Basic stats: COMPLETE Column stats: PARTIAL
Filter Operator [FIL_2556]
predicate:user_no is not null (type: boolean)
Statistics:Num rows: 3142 Data size: 12568 Basic stats: COMPLETE Column stats: PARTIAL
TableScan [TS_2548]
alias:raw_scv_page_view_v2
Statistics:Num rows: 6284 Data size: 756247 Basic stats: COMPLETE Column stats: PARTIAL ```

* The last line tells the size of default partition data.( WHERE created_date = '20160612')

Statistics:Num rows: 6284 Data size: 756247

But it is tooooo small than the partition actually have.Where this value comes from? table statistics?

by this value, tez will will allocate only two reducers ( too small )

Current Reducer setting

```SET hive.exec.reducers.bytes.per.reducer
>> 1143891558```

Actual row count

```SELECT count(*) from raw_scv_page_view_v2 WHERE created_date = '20160612';
>> 316,626,655

SELECT count(*) from raw_scv_page_view_v2 WHERE created_date = '20160612' and user_no is not null;
_c0
>> 316,262,138
```

Actual data size ( x GB files in total exists )

```/ data/ flume/ scv/ raw_scv_page_view/ 2016/ 06/ 12/ 1
raw_scv_page_view_2016-06-12_00_xflume101.band.nhnsystem.com.1465657201940.avro	19.3 MB	irteam	irteam	-rw-r--r--	June 12, 2016 01:01 AM
.........
```

Query tasks ( only two reducer used.. )

```INFO : Tez session hasn't been created yet. Opening session
INFO : Dag name: SELECT user_no ,
bn band_no ,
...ip(Stage-1)
INFO :
INFO : Status: Running (Executing on YARN cluster with App id application_1465874164014_0281)
INFO : Map 1: -/-	Reducer 2: 0/2
INFO : Map 1: 0/68	Reducer 2: 0/2
INFO : Map 1: 0(+3)/68	Reducer 2: 0/2
INFO : Map 1: 0(+5)/68	Reducer 2: 0/2
INFO : Map 1: 0(+8)/68	Reducer 2: 0/2
INFO : Map 1: 0(+29)/68	Reducer 2: 0/2
INFO : Map 1: 0(+58)/68	Reducer 2: 0/2
INFO : Map 1: 0(+62)/68	Reducer 2: 0/2
INFO : Map 1: 0(+65)/68	Reducer 2: 0/2
INFO : Map 1: 0(+66)/68	Reducer 2: 0/2
INFO : Map 1: 0(+68)/68	Reducer 2: 0/2
INFO : Map 1: 0(+68)/68	Reducer 2: 0/2
INFO : Map 1: 0(+68)/68	Reducer 2: 0/2
INFO : Map 1: 0(+68)/68	Reducer 2: 0/2
INFO : Map 1: 1(+67)/68	Reducer 2: 0/2
INFO : Map 1: 1(+67)/68	Reducer 2: 0/2
...
INFO : Map 1: 68/68	Reducer 2: 0(+2)/2
INFO : Map 1: 68/68	Reducer 2: 0(+2)/2
INFO : Map 1: 68/68	Reducer 2: 1(+1)/2
INFO : Map 1: 68/68	Reducer 2: 2/2
```

Table schema

```show create table X
CREATE EXTERNAL TABLE `X`(......)
PARTITIONED BY (
`created_date` string,
`page_no` int)
ROW FORMAT SERDE
STORED AS INPUTFORMAT
OUTPUTFORMAT
LOCATION
'hdfs://....../raw_scv_page_view'
TBLPROPERTIES (
'avro.schema.url'='hdfs://......avsc',
'transient_lastDdlTime'='1441181032')
```

This table is external. and data written by flume stream in avro format.

Describe extended. Partition information

```describe extended raw_scv_page_view_v2  partition(created_date='20160612',page_no=1)
Detailed Partition Information	Partition(values:[20160612, 1]
dbName:default
tableName:raw_scv_page_view_v2
createTime:1465657597
lastAccessTime:0
sd:StorageDescriptor(cols:[]
location:hdfs://....ew/2016/06/12/1
compressed:false
numBuckets:-1
serdeInfo:SerDeInfo(name:null
parameters:{serialization.format=1})
bucketCols:[]
sortCols:[]
parameters:{}
skewedInfo:SkewedInfo(skewedColNames:[]
skewedColValues:[]
skewedColValueLocationMaps:{})
storedAsSubDirectories:false)
parameters:{totalSize=30146
numRows=-1
rawDataSize=-1
COLUMN_STATS_ACCURATE=false
numFiles=6
transient_lastDdlTime=1465657597})
```

* the rawDataSize and numRows of describe are related to this problem..?

I know this might not enough detail question. but could please give any advice..

Why explain show too small estimates and tez lead to allocate small reducers?

3 REPLIES 3

# Re: Statistics in query explain is differ from actual data.

Explorer

Update.

I suspected the table statistics, so analyzed all partitions again.

```analyze table raw_scv_page_view_v2  partition(created_date='20160612',page_no=...) compute statistics noscan;
```

Then ran the query again.

The initial num rows is increased, but the last operation's statistics still looks wrong.

```Vertex dependency in root stage
Reducer 2 <- Map 1 (SIMPLE_EDGE)
Stage-0
Fetch Operator
limit:-1
Stage-1
Reducer 2 vectorized
File Output Operator [FS_11]
compressed:false
Statistics:Num rows: 18 Data size: 216 Basic stats: COMPLETE Column stats: PARTIAL
Select Operator [OP_10]
outputColumnNames:["_col0","_col1","_col2","_col3","_col4","_col5"]
Statistics:Num rows: 18 Data size: 216 Basic stats: COMPLETE Column stats: PARTIAL
Group By Operator [OP_9]
|  aggregations:["count(VALUE._col0)"]
|  keys:KEY._col0 (type: bigint), KEY._col1 (type: int), KEY._col2 (type: bigint), KEY._col3 (type: int), KEY._col4 (type: string)
|  outputColumnNames:["_col0","_col1","_col2","_col3","_col4","_col5"]

// STILL too low value..
|  Statistics:Num rows: 18 Data size: 216 Basic stats: COMPLETE Column stats: PARTIAL
|<-Map 1 [SIMPLE_EDGE]
Reduce Output Operator [RS_4]
key expressions:_col0 (type: bigint), _col1 (type: int), _col2 (type: bigint), _col3 (type: int), _col4 (type: string)
Map-reduce partition columns:_col0 (type: bigint), _col1 (type: int), _col2 (type: bigint), _col3 (type: int), _col4 (type: string)
sort order:+++++
Statistics:Num rows: 1098 Data size: 13176 Basic stats: COMPLETE Column stats: PARTIAL
value expressions:_col5 (type: bigint)
Group By Operator [GBY_3]
aggregations:["count(1)"]
keys:user_no (type: bigint), bn (type: int), pn (type: bigint), page_no (type: int), ip (type: string)
outputColumnNames:["_col0","_col1","_col2","_col3","_col4","_col5"]
Statistics:Num rows: 1098 Data size: 13176 Basic stats: COMPLETE Column stats: PARTIAL
Select Operator [SEL_2]
outputColumnNames:["user_no","bn","pn","page_no","ip"]
Statistics:Num rows: 64375934 Data size: 257503736 Basic stats: COMPLETE Column stats: PARTIAL
Filter Operator [FIL_8]
predicate:user_no is not null (type: boolean)
Statistics:Num rows: 64375934 Data size: 257503736 Basic stats: COMPLETE Column stats: PARTIAL
TableScan [TS_0]
alias:raw_scv_page_view_v2
// INCREASED !!
Statistics:Num rows: 128751868 Data size: 15450226170 Basic stats: COMPLETE Column stats: PARTIAL
```

# Re: Statistics in query explain is differ from actual data.

Super Guru

Any problems with the types? Everything look okay. Any chance to move data to Orc?

# Re: Statistics in query explain is differ from actual data.

Explorer

It was due to the configuration "hive.stats.fetch.column.stats=true", but column statistics was not updated to metastore.

when after set hive.stats.fetch.column.stats to false, reducer task count become make sense.

Don't have an account?
Announcements
What's New @ Cloudera
What's New @ Cloudera