Created 01-17-2017 03:40 AM
Hi,
Could someone please explain me understand the below questions on hive partition pruning and explain plan?
1. How to check if partition pruning occurs by checking the explain plan? I thought I would see "Dynamic partitioning event operator" in explain plan, but in my sample query below I am not seeing any such operator. I enabled hive.tez.dynamic.partition.pruning.
-- Since the table does not have much data, it is going for map join, does that have anything to do with partition pruning not happening?
explain select a.* from big_part a, small_np b where a.jdate = b.jdate ; big_part is partitioned on jdate where small_np is a non partitioned table, even adding explicit filter on jdate like jdate = "2017-01-01" is not showing this operator in explain plan. The tables are just in text formats.
I tried disabling and enabling hive.optimize.ppd . But it just changed adding or removing a filter operator much higher in explain plan, but no difference besides that. Will the optimize.ppd parameter have any effect on partition pruning?
2. Is it correct to expect that dynamic partition pruning should happen on big_part table in the above query?
3. If both the tables used in join are partitioned, can we expect that dynamic partition pruning happens on both tables?
4. Will the dynamic partition pruning occur in case of outer joins too? (full and left outer assuming that inner table's conditions are given in "on condition" or outer table's conditions are given in "where clause").
5. What exactly this hive.optimize.ppd do in case of text files? Just push the filter predicates when reading from table itself if possible?
Thank you!
Created 01-17-2017 04:41 AM
1. You are correct that you should see something like "Dynamic partitioning event operator" in the explain plan. From some of the examples in the Hive tests dynamic partition pruning can be used even in the case of mapjoin getting selected.
2. It does look like dynamic partition pruning should occur on those tables .. What version of Hive? Can you show the explain plan and DESCRIBE TABLE on the tables involved? Can you try setting hive.explain.user=false?
3. If both tables are partitioned tables, dynamic partition pruning should only happen on one side
4. Not totally sure about the outer join case - might only work in the case that the non-partitioned table was the outer.
5. I don't think text/non-text matters for hive.optimize.ppd - I believe this has to do with predicate pushdown by the optimizer. If you're talking about pushdown of predicates into the storage level (like for ORC), it looks like that setting is controlled by hive.optimize.index.filter.
Created 01-17-2017 08:10 PM
Thanks for your comments!
Here is the explain plan and create table statements. Hive version is 0.14.
And also for the 3rd answer, in case both are partitioned tables, is there anyway to ensure that bigger of the two tables undergo partition pruning instead of the small one? (or is this the default behavior?)
What does hive.explain.user = false do? I have attached explain plan by both enabling and disabling this.
> show create table big_part; OK CREATE TABLE `big_part`( `id` int) PARTITIONED BY ( `jdate` date) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' LOCATION 'hdfs://littleredns/apps/hive/warehouse/big_part' TBLPROPERTIES ( 'transient_lastDdlTime'='1484615054') Time taken: 1.749 seconds, Fetched: 14 row(s) hive> show create table small_np; OK CREATE TABLE `small_np`( `id2` int, `jdate` date) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' LOCATION 'hdfs://littleredns/apps/hive/warehouse/small_np' TBLPROPERTIES ( 'transient_lastDdlTime'='1484615162') Time taken: 0.16 seconds, Fetched: 13 row(s) hive> set hive.optimize.ppd=true; hive> set hive.tez.dynamic.partition.pruning=true; hive> explain select a.* from big_part a, small_np b where a.jdate = b.jdate; OK Plan not optimized by CBO. Vertex dependency in root stage Map 1 <- Map 2 (BROADCAST_EDGE) Stage-0 Fetch Operator limit:-1 Stage-1 Map 1 vectorized File Output Operator [FS_21] compressed:false Statistics:Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE table:{"input format:":"org.apache.hadoop.mapred.TextInputFormat","output format:":"org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat","serde:":"org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe"} Select Operator [OP_20] outputColumnNames:["_col0","_col1"] Statistics:Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE Filter Operator [FIL_19] predicate:(_col1 = _col6) (type: boolean) Statistics:Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE Map Join Operator [MAPJOIN_18] | condition map:[{"":"Inner Join 0 to 1"}] | HybridGraceHashJoin:true | keys:{"Map 2":"jdate (type: date)","Map 1":"jdate (type: date)"} | outputColumnNames:["_col0","_col1","_col6"] | Statistics:Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE |<-Map 2 [BROADCAST_EDGE] | Reduce Output Operator [RS_4] | key expressions:jdate (type: date) | Map-reduce partition columns:jdate (type: date) | sort order:+ | Statistics:Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE | Filter Operator [FIL_14] | predicate:jdate is not null (type: boolean) | Statistics:Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE | TableScan [TS_1] | alias:b | Statistics:Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE |<-Filter Operator [FIL_17] predicate:jdate is not null (type: boolean) Statistics:Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: PARTIAL TableScan [TS_0] alias:a Statistics:Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: PARTIAL Time taken: 1.459 seconds, Fetched: 45 row(s)
Created 01-18-2017 01:03 AM
- Not sure why dynamic partition pruning is not being chosen .. you might want to try running with DEBUG level logging and run the EXPLAIN statement again, that may give some clues as to what is going on during compiliation.
- As far as I can tell it looks like it tries to preserve the larger of the 2 tables for dynamic partition pruning.
- hive.explain.user just creates a more "user-friendly" explain, but sometimes the plan is shown a little differently so sometimes it helps to see it without that option.
Created 01-17-2017 08:13 PM
Could not paste both the explain plans in previous comment. Here is the explain plan by disabling hive.explain.user=false.
hive> set hive.explain.user=false; hive> explain select a.* from big_part a, small_np b where a.jdate = b.jdate; OK STAGE DEPENDENCIES: Stage-1 is a root stage Stage-0 depends on stages: Stage-1 STAGE PLANS: Stage: Stage-1 Tez DagId: A515595_20170117140547_4494cba3-581e-441c-8fb6-8175b74d89c2:3 Edges: Map 1 <- Map 2 (BROADCAST_EDGE) DagName: Vertices: Map 1 Map Operator Tree: TableScan alias: a filterExpr: jdate is not null (type: boolean) Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: PARTIAL Filter Operator predicate: jdate is not null (type: boolean) Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: PARTIAL Map Join Operator condition map: Inner Join 0 to 1 keys: 0 jdate (type: date) 1 jdate (type: date) outputColumnNames: _col0, _col1, _col6 input vertices: 1 Map 2 Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE HybridGraceHashJoin: true Filter Operator predicate: (_col1 = _col6) (type: boolean) Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE Select Operator expressions: _col0 (type: int), _col1 (type: date) outputColumnNames: _col0, _col1 Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE File Output Operator compressed: false Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE table: input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe Execution mode: vectorized Map 2 Map Operator Tree: TableScan alias: b filterExpr: jdate is not null (type: boolean) Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE Filter Operator predicate: jdate is not null (type: boolean) Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE Reduce Output Operator key expressions: jdate (type: date) sort order: + Map-reduce partition columns: jdate (type: date) Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE Stage: Stage-0 Fetch Operator limit: -1 Processor Tree: ListSink Time taken: 0.428 seconds, Fetched: 68 row(s)