Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Question on hive dynamic partition pruning and explain plan

Question on hive dynamic partition pruning and explain plan

Expert Contributor

Hi,

Could someone please explain me understand the below questions on hive partition pruning and explain plan?

1. How to check if partition pruning occurs by checking the explain plan? I thought I would see "Dynamic partitioning event operator" in explain plan, but in my sample query below I am not seeing any such operator. I enabled hive.tez.dynamic.partition.pruning.

-- Since the table does not have much data, it is going for map join, does that have anything to do with partition pruning not happening?

explain select a.* from big_part a, small_np b where a.jdate = b.jdate ;

big_part is partitioned on jdate where small_np is a non partitioned table, even adding explicit filter on jdate like jdate = "2017-01-01" is not showing this operator in explain plan.
The tables are just in text formats.

I tried disabling and enabling hive.optimize.ppd . But it just changed adding or removing a filter operator much higher in explain plan, but no difference besides that. Will the optimize.ppd parameter have any effect on partition pruning?

2. Is it correct to expect that dynamic partition pruning should happen on big_part table in the above query?

3. If both the tables used in join are partitioned, can we expect that dynamic partition pruning happens on both tables?

4. Will the dynamic partition pruning occur in case of outer joins too? (full and left outer assuming that inner table's conditions are given in "on condition" or outer table's conditions are given in "where clause").

5. What exactly this hive.optimize.ppd do in case of text files? Just push the filter predicates when reading from table itself if possible?

Thank you!

4 REPLIES 4

Re: Question on hive dynamic partition pruning and explain plan

Cloudera Employee

1. You are correct that you should see something like "Dynamic partitioning event operator" in the explain plan. From some of the examples in the Hive tests dynamic partition pruning can be used even in the case of mapjoin getting selected.

2. It does look like dynamic partition pruning should occur on those tables .. What version of Hive? Can you show the explain plan and DESCRIBE TABLE on the tables involved? Can you try setting hive.explain.user=false?

3. If both tables are partitioned tables, dynamic partition pruning should only happen on one side

4. Not totally sure about the outer join case - might only work in the case that the non-partitioned table was the outer.

5. I don't think text/non-text matters for hive.optimize.ppd - I believe this has to do with predicate pushdown by the optimizer. If you're talking about pushdown of predicates into the storage level (like for ORC), it looks like that setting is controlled by hive.optimize.index.filter.

Re: Question on hive dynamic partition pruning and explain plan

Expert Contributor

Thanks for your comments!

Here is the explain plan and create table statements. Hive version is 0.14.

And also for the 3rd answer, in case both are partitioned tables, is there anyway to ensure that bigger of the two tables undergo partition pruning instead of the small one? (or is this the default behavior?)

What does hive.explain.user = false do? I have attached explain plan by both enabling and disabling this.

   > show create table big_part;
OK
CREATE TABLE `big_part`(
  `id` int)
PARTITIONED BY ( 
  `jdate` date)
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  'hdfs://littleredns/apps/hive/warehouse/big_part'
TBLPROPERTIES (
  'transient_lastDdlTime'='1484615054')
Time taken: 1.749 seconds, Fetched: 14 row(s)
hive> show create table small_np;
OK
CREATE TABLE `small_np`(
  `id2` int, 
  `jdate` date)
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  'hdfs://littleredns/apps/hive/warehouse/small_np'
TBLPROPERTIES (
  'transient_lastDdlTime'='1484615162')
Time taken: 0.16 seconds, Fetched: 13 row(s)
hive> set hive.optimize.ppd=true;
hive> set hive.tez.dynamic.partition.pruning=true;
hive> explain select a.* from big_part a, small_np b where a.jdate = b.jdate;
OK
Plan not optimized by CBO.
Vertex dependency in root stage
Map 1 <- Map 2 (BROADCAST_EDGE)
Stage-0
  Fetch Operator
      limit:-1
      Stage-1
        Map 1 vectorized
        File Output Operator [FS_21]
            compressed:false
            Statistics:Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
            table:{"input format:":"org.apache.hadoop.mapred.TextInputFormat","output format:":"org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat","serde:":"org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe"}
            Select Operator [OP_20]
              outputColumnNames:["_col0","_col1"]
              Statistics:Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
              Filter Operator [FIL_19]
                  predicate:(_col1 = _col6) (type: boolean)
                  Statistics:Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
                  Map Join Operator [MAPJOIN_18]
                  |  condition map:[{"":"Inner Join 0 to 1"}]
                  |  HybridGraceHashJoin:true
                  |  keys:{"Map 2":"jdate (type: date)","Map 1":"jdate (type: date)"}
                  |  outputColumnNames:["_col0","_col1","_col6"]
                  |  Statistics:Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
                  |<-Map 2 [BROADCAST_EDGE]
                  |  Reduce Output Operator [RS_4]
                  |     key expressions:jdate (type: date)
                  |     Map-reduce partition columns:jdate (type: date)
                  |     sort order:+
                  |     Statistics:Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
                  |     Filter Operator [FIL_14]
                  |        predicate:jdate is not null (type: boolean)
                  |        Statistics:Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
                  |        TableScan [TS_1]
                  |           alias:b
                  |           Statistics:Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
                  |<-Filter Operator [FIL_17]
                        predicate:jdate is not null (type: boolean)
                        Statistics:Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: PARTIAL
                        TableScan [TS_0]


                          alias:a
                          Statistics:Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: PARTIAL
Time taken: 1.459 seconds, Fetched: 45 row(s)

Re: Question on hive dynamic partition pruning and explain plan

Cloudera Employee

- Not sure why dynamic partition pruning is not being chosen .. you might want to try running with DEBUG level logging and run the EXPLAIN statement again, that may give some clues as to what is going on during compiliation.

- As far as I can tell it looks like it tries to preserve the larger of the 2 tables for dynamic partition pruning.

- hive.explain.user just creates a more "user-friendly" explain, but sometimes the plan is shown a little differently so sometimes it helps to see it without that option.

Highlighted

Re: Question on hive dynamic partition pruning and explain plan

Expert Contributor

Could not paste both the explain plans in previous comment. Here is the explain plan by disabling hive.explain.user=false.

hive> set hive.explain.user=false;

hive> explain select a.* from big_part a, small_np b where a.jdate = b.jdate;

OK

STAGE DEPENDENCIES:

  Stage-1 is a root stage

  Stage-0 depends on stages: Stage-1




STAGE PLANS:

  Stage: Stage-1

    Tez

      DagId: A515595_20170117140547_4494cba3-581e-441c-8fb6-8175b74d89c2:3

      Edges:

        Map 1 <- Map 2 (BROADCAST_EDGE)

      DagName: 

      Vertices:

        Map 1 

            Map Operator Tree:

                TableScan

                  alias: a

                  filterExpr: jdate is not null (type: boolean)

                  Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: PARTIAL

                  Filter Operator

                    predicate: jdate is not null (type: boolean)

                    Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: PARTIAL

                    Map Join Operator

                      condition map:

                           Inner Join 0 to 1

                      keys:

                        0 jdate (type: date)

                        1 jdate (type: date)

                      outputColumnNames: _col0, _col1, _col6

                      input vertices:

                        1 Map 2

                      Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE

                      HybridGraceHashJoin: true

                      Filter Operator

                        predicate: (_col1 = _col6) (type: boolean)

                        Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE

                        Select Operator

                          expressions: _col0 (type: int), _col1 (type: date)

                          outputColumnNames: _col0, _col1

                          Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE

                          File Output Operator

                            compressed: false

                            Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE

                            table:

                                input format: org.apache.hadoop.mapred.TextInputFormat

                                output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

                                serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

            Execution mode: vectorized

        Map 2 

            Map Operator Tree:

                TableScan

                  alias: b

                  filterExpr: jdate is not null (type: boolean)

                  Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE

                  Filter Operator

                    predicate: jdate is not null (type: boolean)

                    Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE

                    Reduce Output Operator

                      key expressions: jdate (type: date)

                      sort order: +

                      Map-reduce partition columns: jdate (type: date)

                      Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE




  Stage: Stage-0

    Fetch Operator

      limit: -1

      Processor Tree:

        ListSink




Time taken: 0.428 seconds, Fetched: 68 row(s)