Created on 08-02-2017 10:14 PM
Goal: Understand why statistics are useful in hive
Table with stats
0: jdbc:hive2://xlautomation-2.h.c:10000/defa> select count(*) from mytable; +---------+--+ | _c0 | +---------+--+ | 843280 | +---------+--+ 1 row selected (0.332 seconds) 0: jdbc:hive2://xlautomation-2.h.c:10000/defa>
Here the data is available in metastore so there is no need to launch map tasks to gather how many rows are there in the table for the query. Look-ups are faster as accessing data from metastore is faster any day compared to launching map tasks.
Without column statistics
0: jdbc:hive2://xlautomation-2.h.c:10000/defa> select count(col1) from abcd; INFO : Session is already open INFO : Dag name: select count(col1) from abcd(Stage-1) INFO : Status: Running (Executing on YARN cluster with App id application_1499274604190_0031) +-------+--+ | _c0 | +-------+--+ | 1000 | +-------+--+ 1 row selected (2.109 seconds) << Takes about 2 seconds
After update
0: jdbc:hive2://xlautomation-2.h.c:10000/defa> analyze table abcd compute statistics for columns col1; INFO : Session is already open INFO : Dag name: analyze table abcd compute statistics...col1(Stage-0) INFO : Status: Running (Executing on YARN cluster with App id application_1499274604190_0031) No rows affected (2.61 seconds) 0: jdbc:hive2://xlautomation-2.h.c:10000/defa> select count(col1) from abcd; +-------+--+ | _c0 | +-------+--+ | 1000 | +-------+--+ 1 row selected (0.344 seconds) <<< Runs within 1/3 of a second 0: jdbc:hive2://xlautomation-2.h.c:10000/defa>
When to run ANALYZE or gather statistics
If the variation in the data is too much, 30% or more (depends on what is acceptable based on runtimes), we can choose to run ANALYZE. In this example, the change in the dataset is almost 200%
0: jdbc:hive2://xlautomation-2.h.c:10000/defa> insert into abcd select * from mytable where col1 > 5000 limit 2000; INFO : Session is already open INFO : Dag name: insert into abcd select * from mytabl...2000(Stage-1) INFO : Status: Running (Executing on YARN cluster with App id application_1499274604190_0031) -------------------------------------------------------------------------------- VERTICES STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED -------------------------------------------------------------------------------- Map 1 .......... SUCCEEDED 6 6 0 0 0 0 Reducer 2 ...... SUCCEEDED 1 1 0 0 0 0 -------------------------------------------------------------------------------- VERTICES: 02/02 [==========================>>] 100% ELAPSED TIME: 16.49 s -------------------------------------------------------------------------------- INFO : Loading data to table default.abcd from hdfs://xlautomation-1.h.c:8020/apps/hive/warehouse/abcd/.hive-staging_hive_2017-08-02_20-39-39_024_4775642732421672051-1/-ext-10000 INFO : Table default.abcd stats: [numFiles=1, numRows=1000, totalSize=128640, rawDataSize=127640] No rows affected (17.647 seconds)
Running the query post this variation in data, the runtimes are impacted for the same query which ran faster before
0: jdbc:hive2://xlautomation-2.h.c:10000/defa> select count(col1) from abcd; INFO : Session is already open INFO : Dag name: select count(col1) from abcd(Stage-1) INFO : Status: Running (Executing on YARN cluster with App id application_1499274604190_0031) +-------+--+ | _c0 | +-------+--+ | 1000 | +-------+--+ 1 row selected (3.284 seconds) <<<<< Time increased
Lets update the statistics
0: jdbc:hive2://xlautomation-2.h.c:10000/defa> analyze table abcd compute statistics for columns col1; INFO : Session is already open INFO : Dag name: analyze table abcd compute statistics...col1(Stage-0) INFO : Status: Running (Executing on YARN cluster with App id application_1499274604190_0031) No rows affected (3.374 seconds) 0: jdbc:hive2://xlautomation-2.h.c:10000/defa> select count(col1) from abcd; +-------+--+ | _c0 | +-------+--+ | 1000 | +-------+--+ 1 row selected (0.346 seconds) <<<<<< Back to almost quarter of a second to fetch the same data from metastore.
Which column to pick
It is not necessary to gather statistics on all the columns, we can choose to consider the columns which are being used in queries. We can verify if we are collecting stats for a column even by looking at the explain plan
0: jdbc:hive2://xlautomation-2.h.c:10000/defa> explain select count(col2) from abcd; +-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--+ | Explain | +-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--+ | Plan not optimized by CBO. | | | | Vertex dependency in root stage | | Reducer 2 <- Map 1 (SIMPLE_EDGE) | | | | Stage-0 | | Fetch Operator | | limit:-1 | | Stage-1 | | Reducer 2 | | File Output Operator [FS_192] | | compressed:false | | Statistics:Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE | | table:{"input format:":"org.apache.hadoop.mapred.TextInputFormat","output format:":"org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat","serde:":"org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe"} | | Group By Operator [GBY_190] | | | aggregations:["count(VALUE._col0)"] | | | outputColumnNames:["_col0"] | | | Statistics:Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE | | |<-Map 1 [SIMPLE_EDGE] | | Reduce Output Operator [RS_189] | | sort order: | | Statistics:Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE | | value expressions:_col0 (type: bigint) | | Group By Operator [GBY_188] | | aggregations:["count(col2)"] | | outputColumnNames:["_col0"] | | Statistics:Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE | | Select Operator [SEL_187] | | outputColumnNames:["col2"] | | Statistics:Num rows: 1000 Data size: 127640 Basic stats: COMPLETE Column stats: NONE | | TableScan [TS_186] | | alias:abcd | | Statistics:Num rows: 1000 Data size: 127640 Basic stats: COMPLETE Column stats: NONE | | | +-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--+ 34 rows selected (0.384 seconds)
Once the stats are gathered, the plan is simplified:
0: jdbc:hive2://xlautomation-2.h.c:10000/defa> explain select count(col2) from abcd; +-----------------------------+--+ | Explain | +-----------------------------+--+ | Plan not optimized by CBO. | | | | Stage-0 | | Fetch Operator | | limit:1 | | | +-----------------------------+--+ 6 rows selected (0.306 seconds)
Considerations for Statistics
0: jdbc:hive2://xlautomation-2.h.c:10000/defa> set hive.stats.autogather; +------------------------------+--+ | set | +------------------------------+--+ | hive.stats.autogather=true | +------------------------------+--+ 1 row selected (0.03 seconds)
0: jdbc:hive2://xlautomation-2.h.c:10000/defa> analyze table zzzz compute statistics; INFO : Session is already open INFO : Dag name: analyze table zzzz compute statistics(Stage-0) INFO : Tez session was closed. Reopening... INFO : Session re-established. INFO : Status: Running (Executing on YARN cluster with App id application_1499274604190_0033) INFO : Table default.zzzz stats: [numFiles=1, numRows=1000, totalSize=128640, rawDataSize=127640] -------------------------------------------------------------------------------- VERTICES STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED -------------------------------------------------------------------------------- Map 1 .......... SUCCEEDED 1 1 0 0 0 0 -------------------------------------------------------------------------------- VERTICES: 01/01 [==========================>>] 100% ELAPSED TIME: 2.07 s -------------------------------------------------------------------------------- No rows affected (21.615 seconds) 0: jdbc:hive2://xlautomation-2.h.c:10000/defa> analyze table zzzz compute statistics for columns; INFO : Session is already open INFO : Dag name: analyze table zzzz compute statist...columns(Stage-0) INFO : Status: Running (Executing on YARN cluster with App id application_1499274604190_0033) No rows affected (4.626 seconds) 0: jdbc:hive2://xlautomation-2.h.c:10000/defa> analyze table zzzz compute statistics for columns col1; INFO : Session is already open INFO : Dag name: analyze table zzzz compute statistics...col1(Stage-0) INFO : Status: Running (Executing on YARN cluster with App id application_1499274604190_0033) No rows affected (3.299 seconds) 0: jdbc:hive2://xlautomation-2.h.c:10000/defa>
ANALZYE TABLE zzz PARTITION (idate=2017-07-29) COMPUTE STATISTICS
0: jdbc:hive2://xlautomation-2.h.c:10000/defa> ANALYZE TABLE zzzz compute statistics NOSCAN; INFO : Table default.zzzz stats: [numFiles=1, numRows=1000, totalSize=128640, rawDataSize=127640] No rows affected (0.455 seconds) 0: jdbc:hive2://xlautomation-2.h.c:10000/defa>
Reference
https://cwiki.apache.org/confluence/display/Hive/StatsDev#StatsDev-StatisticsinHive
Created on 08-04-2017 01:23 AM
Thank you very much for useful information.
Could I ask if autogather also updates column level stats?
Created on 06-24-2018 06:44 PM
Nice Article.