Community Articles

Find and share helpful community-sourced technical articles.
Labels (1)
avatar
Guru

Goal: Understand why statistics are useful in hive


Table with stats

0: jdbc:hive2://xlautomation-2.h.c:10000/defa> select count(*) from mytable;
+---------+--+
|   _c0   |
+---------+--+
| 843280  |
+---------+--+
1 row selected (0.332 seconds)
0: jdbc:hive2://xlautomation-2.h.c:10000/defa>

Here the data is available in metastore so there is no need to launch map tasks to gather how many rows are there in the table for the query. Look-ups are faster as accessing data from metastore is faster any day compared to launching map tasks.

Without column statistics

0: jdbc:hive2://xlautomation-2.h.c:10000/defa> select count(col1) from abcd;
INFO  : Session is already open
INFO  : Dag name: select count(col1) from abcd(Stage-1)
INFO  : Status: Running (Executing on YARN cluster with App id application_1499274604190_0031)


+-------+--+
|  _c0  |
+-------+--+
| 1000  |
+-------+--+
1 row selected (2.109 seconds)  << Takes about 2 seconds

After update

0: jdbc:hive2://xlautomation-2.h.c:10000/defa> analyze table abcd compute statistics for columns col1;
INFO  : Session is already open
INFO  : Dag name: analyze table abcd compute statistics...col1(Stage-0)
INFO  : Status: Running (Executing on YARN cluster with App id application_1499274604190_0031)

No rows affected (2.61 seconds)
0: jdbc:hive2://xlautomation-2.h.c:10000/defa> select count(col1) from abcd;
+-------+--+
|  _c0  |
+-------+--+
| 1000  |
+-------+--+
1 row selected (0.344 seconds) <<< Runs within 1/3 of a second
0: jdbc:hive2://xlautomation-2.h.c:10000/defa> 

When to run ANALYZE or gather statistics

If the variation in the data is too much, 30% or more (depends on what is acceptable based on runtimes), we can choose to run ANALYZE. In this example, the change in the dataset is almost 200%

0: jdbc:hive2://xlautomation-2.h.c:10000/defa> insert into abcd select * from mytable where col1 > 5000 limit 2000;
INFO  : Session is already open
INFO  : Dag name: insert into abcd select * from mytabl...2000(Stage-1)
INFO  : Status: Running (Executing on YARN cluster with App id application_1499274604190_0031)
--------------------------------------------------------------------------------
        VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED
--------------------------------------------------------------------------------
Map 1 ..........   SUCCEEDED      6          6        0        0       0       0
Reducer 2 ......   SUCCEEDED      1          1        0        0       0       0
--------------------------------------------------------------------------------
VERTICES: 02/02  [==========================>>] 100%  ELAPSED TIME: 16.49 s    
--------------------------------------------------------------------------------
INFO  : Loading data to table default.abcd from hdfs://xlautomation-1.h.c:8020/apps/hive/warehouse/abcd/.hive-staging_hive_2017-08-02_20-39-39_024_4775642732421672051-1/-ext-10000
INFO  : Table default.abcd stats: [numFiles=1, numRows=1000, totalSize=128640, rawDataSize=127640]
No rows affected (17.647 seconds)

Running the query post this variation in data, the runtimes are impacted for the same query which ran faster before

0: jdbc:hive2://xlautomation-2.h.c:10000/defa> select count(col1) from abcd;
INFO  : Session is already open
INFO  : Dag name: select count(col1) from abcd(Stage-1)
INFO  : Status: Running (Executing on YARN cluster with App id application_1499274604190_0031)
+-------+--+
|  _c0  |
+-------+--+
| 1000  |
+-------+--+
1 row selected (3.284 seconds) <<<<< Time increased

Lets update the statistics

0: jdbc:hive2://xlautomation-2.h.c:10000/defa> analyze table abcd compute statistics for columns col1;
INFO  : Session is already open
INFO  : Dag name: analyze table abcd compute statistics...col1(Stage-0)
INFO  : Status: Running (Executing on YARN cluster with App id application_1499274604190_0031)

No rows affected (3.374 seconds)
0: jdbc:hive2://xlautomation-2.h.c:10000/defa> select count(col1) from abcd;
+-------+--+
|  _c0  |
+-------+--+
| 1000  |
+-------+--+
1 row selected (0.346 seconds)  <<<<<< Back to almost quarter of a second to fetch the same data from metastore. 

Which column to pick

It is not necessary to gather statistics on all the columns, we can choose to consider the columns which are being used in queries. We can verify if we are collecting stats for a column even by looking at the explain plan

0: jdbc:hive2://xlautomation-2.h.c:10000/defa> explain select count(col2) from abcd;
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--+
|                                                                                                           Explain                                                                                                           |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--+
| Plan not optimized by CBO.                                                                                                                                                                                                  |
|                                                                                                                                                                                                                             |
| Vertex dependency in root stage                                                                                                                                                                                             |
| Reducer 2 <- Map 1 (SIMPLE_EDGE)                                                                                                                                                                                            |
|                                                                                                                                                                                                                             |
| Stage-0                                                                                                                                                                                                                     |
|    Fetch Operator                                                                                                                                                                                                           |
|       limit:-1                                                                                                                                                                                                              |
|       Stage-1                                                                                                                                                                                                               |
|          Reducer 2                                                                                                                                                                                                          |
|          File Output Operator [FS_192]                                                                                                                                                                                      |
|             compressed:false                                                                                                                                                                                                |
|             Statistics:Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE                                                                                                                                    |
|             table:{"input format:":"org.apache.hadoop.mapred.TextInputFormat","output format:":"org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat","serde:":"org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe"}  |
|             Group By Operator [GBY_190]                                                                                                                                                                                     |
|             |  aggregations:["count(VALUE._col0)"]                                                                                                                                                                          |
|             |  outputColumnNames:["_col0"]                                                                                                                                                                                  |
|             |  Statistics:Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE                                                                                                                                 |
|             |<-Map 1 [SIMPLE_EDGE]                                                                                                                                                                                          |
|                Reduce Output Operator [RS_189]                                                                                                                                                                              |
|                   sort order:                                                                                                                                                                                               |
|                   Statistics:Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE                                                                                                                              |
|                   value expressions:_col0 (type: bigint)                                                                                                                                                                    |
|                   Group By Operator [GBY_188]                                                                                                                                                                               |
|                      aggregations:["count(col2)"]                                                                                                                                                                           |
|                      outputColumnNames:["_col0"]                                                                                                                                                                            |
|                      Statistics:Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE                                                                                                                           |
|                      Select Operator [SEL_187]                                                                                                                                                                              |
|                         outputColumnNames:["col2"]                                                                                                                                                                          |
|                         Statistics:Num rows: 1000 Data size: 127640 Basic stats: COMPLETE Column stats: NONE                                                                                                                |
|                         TableScan [TS_186]                                                                                                                                                                                  |
|                            alias:abcd                                                                                                                                                                                       |
|                            Statistics:Num rows: 1000 Data size: 127640 Basic stats: COMPLETE Column stats: NONE                                                                                                             |
|                                                                                                                                                                                                                             |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--+
34 rows selected (0.384 seconds)

Once the stats are gathered, the plan is simplified:

0: jdbc:hive2://xlautomation-2.h.c:10000/defa> explain select count(col2) from abcd;
+-----------------------------+--+
|           Explain           |
+-----------------------------+--+
| Plan not optimized by CBO.  |
|                             |
| Stage-0                     |
|    Fetch Operator           |
|       limit:1               |
|                             |
+-----------------------------+--+
6 rows selected (0.306 seconds)

Considerations for Statistics

  • Is enabled by default, can be verified using
0: jdbc:hive2://xlautomation-2.h.c:10000/defa> set hive.stats.autogather;
+------------------------------+--+
|             set              |
+------------------------------+--+
| hive.stats.autogather=true  |
+------------------------------+--+
1 row selected (0.03 seconds)
  • Stats can be manually gathered using ANALYZE for both table and column levels (one, more or all)
0: jdbc:hive2://xlautomation-2.h.c:10000/defa> analyze table zzzz compute statistics;
INFO  : Session is already open
INFO  : Dag name: analyze table zzzz compute statistics(Stage-0)
INFO  : Tez session was closed. Reopening...
INFO  : Session re-established.
INFO  : Status: Running (Executing on YARN cluster with App id application_1499274604190_0033)

INFO  : Table default.zzzz stats: [numFiles=1, numRows=1000, totalSize=128640, rawDataSize=127640]
--------------------------------------------------------------------------------
        VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED
--------------------------------------------------------------------------------
Map 1 ..........   SUCCEEDED      1          1        0        0       0       0
--------------------------------------------------------------------------------
VERTICES: 01/01  [==========================>>] 100%  ELAPSED TIME: 2.07 s     
--------------------------------------------------------------------------------
No rows affected (21.615 seconds)
0: jdbc:hive2://xlautomation-2.h.c:10000/defa> analyze table zzzz compute statistics for columns;
INFO  : Session is already open
INFO  : Dag name: analyze table zzzz compute statist...columns(Stage-0)
INFO  : Status: Running (Executing on YARN cluster with App id application_1499274604190_0033)

No rows affected (4.626 seconds)
0: jdbc:hive2://xlautomation-2.h.c:10000/defa> analyze table zzzz compute statistics for columns col1;
INFO  : Session is already open
INFO  : Dag name: analyze table zzzz compute statistics...col1(Stage-0)
INFO  : Status: Running (Executing on YARN cluster with App id application_1499274604190_0033)

No rows affected (3.299 seconds)
0: jdbc:hive2://xlautomation-2.h.c:10000/defa> 
  • Can be gathered for specific partitions and partition columns
ANALZYE TABLE zzz PARTITION (idate=2017-07-29) COMPUTE STATISTICS
  • Other parameters include NOSCAN/CACHE METADATA, where when NOSCAN is specified only the number of physical files and their bytes are gathered for statistics. CACHE METADATA is relevant when hbase is being used to store the temporary metadata.
0: jdbc:hive2://xlautomation-2.h.c:10000/defa> ANALYZE TABLE zzzz compute statistics NOSCAN;
INFO  : Table default.zzzz stats: [numFiles=1, numRows=1000, totalSize=128640, rawDataSize=127640]
No rows affected (0.455 seconds)
0: jdbc:hive2://xlautomation-2.h.c:10000/defa> 

Reference

https://cwiki.apache.org/confluence/display/Hive/StatsDev#StatsDev-StatisticsinHive

7,602 Views
Comments
avatar

Thank you very much for useful information.
Could I ask if autogather also updates column level stats?

avatar
Expert Contributor

Nice Article.