Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

MERGE statement and partition pruning

MERGE statement and partition pruning

Expert Contributor

I have 2 tables, with the same structure:

CREATE TABLE IF NOT EXISTS src (
    -- DISTINCT field (+ partitions)
    id BIGINT

    -- other fields
  , email STRING
  , domain STRING
  , lang STRING
  , mobile_nr STRING

  , custom_fields STRING
  , groups array<struct<group_id:bigint,campaign_id:bigint,member_since_ts_utc:bigint>>

  , ts_utc TIMESTAMP

  , sys_schema_version INT
  , sys_server_ipv4 BIGINT
  , sys_server_name STRING
  )
  PARTITIONED BY (
      license_name STRING
    , campaign_id INT
    )
  CLUSTERED BY (id)
  INTO 64 BUCKETS
  STORED AS ORC

One is a source table (basically recreated from scratch with new data for each merge, so needs to be fully reprocessed everytime) one is the destination table, which will grow.

Both tables have the same partitions and bucket definitions. When I EXPLAIN the MERGE statement, which has a join on the partitions and the bucketed field, I cannot see any partition pruning happening.

set hive.merge.cardinality.check=false;
set hive.tez.exec.print.summary=true;
set tez.user.explain=true;
explain MERGE INTO
-- default.2steps_false_64_1
vault.contact
dst
USING default.2steps_2steps_false_64_1 src
ON
        dst.license_name = src.license_name
    AND dst.campaign_id = src.campaign_id
    AND dst.id = src.id
-- On match: keep latest loaded
WHEN MATCHED
    AND dst.updated_on_utc < src.ts_utc
THEN UPDATE SET
    -- other fields
    email =  src.email
  , city = src.city
  , lang =  src.lang
  , mobile_nr =  src.mobile_nr
  , custom_fields = src.custom_fields
  , groups = src.groups
  , updated_on_utc = src.ts_utc
  , sys_schema_version = src.sys_schema_version
  , sys_server_ipv4 = src.sys_server_ipv4
  , sys_server_name = src.sys_server_name
WHEN NOT MATCHED  THEN INSERT VALUES (
    src.id
  , src.email
  , src.city
  , src.lang
  , src.mobile_nr
  , src.custom_fields
  , src.groups
  , src.ts_utc
  , src.ts_utc
  , NULL -- deleted_on
  , src.sys_schema_version
  , src.sys_server_ipv4
  , src.sys_server_name
  , src.license_name
  , src.campaign_id
)
;
+-----------------------------------------------------------------------------------------------------------------------------+--+
|                                                                                                                                                                                                                                   Explain
+-----------------------------------------------------------------------------------------------------------------------------+--+
| Vertex dependency in root stage
| Reducer 2 <- Map 1 (SIMPLE_EDGE), Map 5 (SIMPLE_EDGE)
| Reducer 3 <- Reducer 2 (SIMPLE_EDGE)
| Reducer 4 <- Reducer 2 (SIMPLE_EDGE)
|
| Stage-5
|    Stats-Aggr Operator
|       Stage-0
|          Move Operator
|             partition:{}
|             table:{"name:":"vault.contact","input format:":"org.apache.hadoop.hive.ql.io.orc.OrcInputFormat","output format:":"org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat","serde:":"org.apache.hadoop.hive.ql.io.orc.OrcSerde"}
|             Stage-3
|                Dependency Collection{}
|                   Stage-2
|                      Reducer 3
|                      File Output Operator [FS_904]
|                         compressed:true
|                         Statistics:Num rows: 496014 Data size: 166660704 Basic stats: COMPLETE Column stats: PARTIAL
|                         table:{"name:":"vault.contact","input format:":"org.apache.hadoop.hive.ql.io.orc.OrcInputFormat","output format:":"org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat","serde:":"org.apache.hadoop.hive.ql.io.orc.OrcSerde"}
|                         Select Operator [SEL_901]
|                         |  outputColumnNames:["_col0","_col1","_col2","_col3","_col4","_col5","_col6","_col7","_col8","_col9","_col10","_col11","_col12","_col13","_col14","_col15"]
|                         |  Statistics:Num rows: 496014 Data size: 166660704 Basic stats: COMPLETE Column stats: PARTIAL
|                         |<-Reducer 2 [SIMPLE_EDGE]
|                            Reduce Output Operator [RS_900]
|                               key expressions:_col0 (type: struct<transactionid:bigint,bucketid:int,rowid:bigint>)
|                               Map-reduce partition columns:UDFToInteger(_col0) (type: int)
|                               sort order:+
|                               Statistics:Num rows: 496014 Data size: 257927280 Basic stats: COMPLETE Column stats: PARTIAL
|                               value expressions:_col1 (type: bigint), _col2 (type: string), _col3 (type: string), _col4 (type: string), _col5 (type: string), _col6 (type: string), _col7 (type: array<struct<group_id:bigint,campaign_id:bigint,member_since_ts_utc:bigint>>), _col8 (type: timestamp), _col9 (type: timestamp), _col10 (type: timestamp), _col11 (type: int), _col12 (type: bigint), _col13 (type: string), _col14 (type: string), _col15 (type: bigint)
|                               Select Operator [SEL_899]
|                                  outputColumnNames:["_col0","_col1","_col2","_col3","_col4","_col5","_col6","_col7","_col8","_col9","_col10","_col11","_col12","_col13","_col14","_col15"]
|                                  Statistics:Num rows: 496014 Data size: 257927280 Basic stats: COMPLETE Column stats: PARTIAL
|                                  Filter Operator [FIL_906]
|                                     predicate:((_col13 = _col29) and (_col14 = _col30) and (_col0 = _col18) and (_col7 < _col25)) (type: boolean)
|                                     Statistics:Num rows: 496014 Data size: 226182384 Basic stats: COMPLETE Column stats: PARTIAL
|                                     Merge Join Operator [MERGEJOIN_916]
|                                     |  condition map:[{"":"Right Outer Join0 to 1"}]
|                                     |  keys:{"0":"license_name (type: string), campaign_id (type: bigint), id (type: bigint)","1":"license_name (type: string), UDFToLong(campaign_id) (type: bigint), id (type: bigint)"}
|                                     |  outputColumnNames:["_col0","_col7","_col8","_col9","_col13","_col14","_col17","_col18","_col19","_col20","_col21","_col22","_col23","_col24","_col25","_col26","_col27","_col28","_col29","_col30"]
|                                     |  Statistics:Num rows: 11904348 Data size: 25284835152 Basic stats: COMPLETE Column stats: PARTIAL
|                                     |<-Map 1 [SIMPLE_EDGE]
|                                     |  Reduce Output Operator [RS_889]
|                                     |     key expressions:license_name (type: string), campaign_id (type: bigint), id (type: bigint)
|                                     |     Map-reduce partition columns:license_name (type: string), campaign_id(type: bigint), id (type: bigint)
|                                     |     sort order:+++
|                                     |     Statistics:Num rows: 129102910 Data size: 16525280556 Basic stats: COMPLETE Column stats: PARTIAL
|                                     |     value expressions:updated_on_utc (type: timestamp), created_on_utc (type: timestamp), deleted_on_utc (type: timestamp), ROW__ID (type: struct<transactionid:bigint,bucketid:int,rowid:bigint>)
|                                     |     TableScan [TS_887]
|                                     |        ACID table:true
|                                     |        alias:dst
|                                     |        Statistics:Num rows: 129102910 Data size: 16525280556 Basic stats: COMPLETE Column stats: PARTIAL
|                                     |<-Map 5 [SIMPLE_EDGE]
|                                        Reduce Output Operator [RS_890]
|                                           key expressions:license_name (type: string), UDFToLong(campaign_id) (type: bigint), id (type: bigint)
|                                           Map-reduce partition columns:license_name (type: string), UDFToLong(campaign_id) (type: bigint), id (type: bigint)
|                                           sort order:+++
|                                           Statistics:Num rows: 11904348 Data size: 29935728348 Basic stats: COMPLETE Column stats: PARTIAL
|                                           value expressions:email (type: string), city (type: string), lang (type: string), mobile_nr (type: string), custom_fields (type: string), groups (type: array<struct<group_id:bigint,campaign_id:bigint,member_since_ts_utc:bigint>>), ts_utc (type: timestamp), sys_schema_version (type: int), sys_server_ipv4 (type: bigint), sys_server_name (type: string), campaign_id (type: int)
|                                           TableScan [TS_888]
|                                              alias:src
|                                              Statistics:Num rows: 11904348 Data size: 29935728348 Basic stats: COMPLETE Column stats: PARTIAL
|                      Reducer 4
|                      File Output Operator [FS_897]
|                         compressed:true
|                         Statistics:Num rows: 1 Data size: 188 Basic stats: COMPLETE Column stats: PARTIAL
|                         table:{"name:":"vault.contact","input format:":"org.apache.hadoop.hive.ql.io.orc.OrcInputFormat","output format:":"org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat","serde:":"org.apache.hadoop.hive.ql.io.orc.OrcSerde"}
|                         Select Operator [SEL_895]
|                         |  outputColumnNames:["_col0","_col1","_col2","_col3","_col4","_col5","_col6","_col7","_col8","_col9","_col10","_col11","_col12","_col13","_col14"]
|                         |  Statistics:Num rows: 1 Data size: 188 Basic stats: COMPLETE Column stats: PARTIAL
|                         |<-Reducer 2 [SIMPLE_EDGE]
|                            Reduce Output Operator [RS_894]
|                               Map-reduce partition columns:_col0 (type: bigint)
|                               sort order:
|                               Statistics:Num rows: 1 Data size: 188 Basic stats: COMPLETE Column stats: PARTIAL
|                               value expressions:_col0 (type: bigint), _col1 (type: string), _col2 (type: string), _col3 (type: string), _col4 (type: string), _col5 (type: string), _col6 (type: array<struct<group_id:bigint,campaign_id:bigint,member_since_ts_utc:bigint>>), _col7 (type: timestamp), _col10 (type: int), _col11 (type: bigint), _col12 (type: string), _col13 (type: string), _col14 (type: int)
|                               Select Operator [SEL_893]
|                                  outputColumnNames:["_col0","_col1","_col10","_col11","_col12","_col13","_col14","_col2","_col3","_col4","_col5","_col6","_col7"]
|                                  Statistics:Num rows: 1 Data size: 188 Basic stats: COMPLETE Column stats: PARTIAL
|                                  Filter Operator [FIL_907]
|                                     predicate:(_col13 is null and _col14 is null and _col0 is null) (type: boolean)
|                                     Statistics:Num rows: 1 Data size: 456 Basic stats: COMPLETE Column stats: PARTIAL
|                                      Please refer to the previous Merge Join Operator [MERGEJOIN_916]
| Stage-4
|    Stats-Aggr Operator
|       Stage-1
|          Move Operator
|             partition:{}
|             table:{"name:":"vault.contact","input format:":"org.apache.hadoop.hive.ql.io.orc.OrcInputFormat","output format:":"org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat","serde:":"org.apache.hadoop.hive.ql.io.orc.OrcSerde"}
|              Please refer to the previous Stage-3
|


Other explain with hive.explain.user=true

0: jdbc:hive2://ip-10-0-0-21.eu-west-1.comput> set hive.merge.cardinality.check=false;
0: jdbc:hive2://ip-10-0-0-21.eu-west-1.comput> -- set hive.tez.dynamic.partition.pruning=true;
0: jdbc:hive2://ip-10-0-0-21.eu-west-1.comput> -- set hive.tez.dynamic.partition.pruning.max.data.size=107374182400; -- 100GB
0: jdbc:hive2://ip-10-0-0-21.eu-west-1.comput> set hive.tez.exec.print.summary=true;
0: jdbc:hive2://ip-10-0-0-21.eu-west-1.comput> set tez.user.explain=true;
0: jdbc:hive2://ip-10-0-0-21.eu-west-1.comput> set hive.explain.user=true;
0: jdbc:hive2://ip-10-0-0-21.eu-west-1.comput> explain MERGE INTO
0: jdbc:hive2://ip-10-0-0-21.eu-west-1.comput> -- default.2steps_false_64_1
0: jdbc:hive2://ip-10-0-0-21.eu-west-1.comput> vault.contact
0: jdbc:hive2://ip-10-0-0-21.eu-west-1.comput> dst
0: jdbc:hive2://ip-10-0-0-21.eu-west-1.comput> USING default.2steps_2steps_false_64_1 src
0: jdbc:hive2://ip-10-0-0-21.eu-west-1.comput> ON
0: jdbc:hive2://ip-10-0-0-21.eu-west-1.comput>         dst.license_name = src.license_name
0: jdbc:hive2://ip-10-0-0-21.eu-west-1.comput>     AND dst.campaign_id = src.campaign_id
0: jdbc:hive2://ip-10-0-0-21.eu-west-1.comput>     AND dst.id = src.id
0: jdbc:hive2://ip-10-0-0-21.eu-west-1.comput>     AND dst.license_name = 'baarn'
0: jdbc:hive2://ip-10-0-0-21.eu-west-1.comput> -- On match: keep latest loaded
0: jdbc:hive2://ip-10-0-0-21.eu-west-1.comput> WHEN MATCHED
0: jdbc:hive2://ip-10-0-0-21.eu-west-1.comput>     AND dst.updated_on_utc < src.ts_utc
0: jdbc:hive2://ip-10-0-0-21.eu-west-1.comput> THEN UPDATE SET
0: jdbc:hive2://ip-10-0-0-21.eu-west-1.comput>     -- other fields
0: jdbc:hive2://ip-10-0-0-21.eu-west-1.comput>     email =  src.email
0: jdbc:hive2://ip-10-0-0-21.eu-west-1.comput>   , domain = src.domain
0: jdbc:hive2://ip-10-0-0-21.eu-west-1.comput>   , lang =  src.lang
0: jdbc:hive2://ip-10-0-0-21.eu-west-1.comput>   , mobile_nr =  src.mobile_nr
0: jdbc:hive2://ip-10-0-0-21.eu-west-1.comput>   , custom_fields = src.custom_fields
0: jdbc:hive2://ip-10-0-0-21.eu-west-1.comput>   , groups = src.groups
0: jdbc:hive2://ip-10-0-0-21.eu-west-1.comput>   , updated_on_utc = src.ts_utc
0: jdbc:hive2://ip-10-0-0-21.eu-west-1.comput>   , sys_schema_version = src.sys_schema_version
0: jdbc:hive2://ip-10-0-0-21.eu-west-1.comput>   , sys_server_ipv4 = src.sys_server_ipv4
0: jdbc:hive2://ip-10-0-0-21.eu-west-1.comput>   , sys_server_name = src.sys_server_name
0: jdbc:hive2://ip-10-0-0-21.eu-west-1.comput> WHEN NOT MATCHED  THEN INSERT VALUES (
0: jdbc:hive2://ip-10-0-0-21.eu-west-1.comput>     src.id
0: jdbc:hive2://ip-10-0-0-21.eu-west-1.comput>   , src.email
0: jdbc:hive2://ip-10-0-0-21.eu-west-1.comput>   , src.domain
0: jdbc:hive2://ip-10-0-0-21.eu-west-1.comput>   , src.lang
0: jdbc:hive2://ip-10-0-0-21.eu-west-1.comput>   , src.mobile_nr
0: jdbc:hive2://ip-10-0-0-21.eu-west-1.comput>   , src.custom_fields
0: jdbc:hive2://ip-10-0-0-21.eu-west-1.comput>   , src.groups
0: jdbc:hive2://ip-10-0-0-21.eu-west-1.comput>   , src.ts_utc
0: jdbc:hive2://ip-10-0-0-21.eu-west-1.comput>   , src.ts_utc
0: jdbc:hive2://ip-10-0-0-21.eu-west-1.comput>   , NULL -- deleted_on
0: jdbc:hive2://ip-10-0-0-21.eu-west-1.comput>   , src.sys_schema_version
0: jdbc:hive2://ip-10-0-0-21.eu-west-1.comput>   , src.sys_server_ipv4
0: jdbc:hive2://ip-10-0-0-21.eu-west-1.comput>   , src.sys_server_name
0: jdbc:hive2://ip-10-0-0-21.eu-west-1.comput>   , src.license_name
0: jdbc:hive2://ip-10-0-0-21.eu-west-1.comput>   , src.campaign_id
0: jdbc:hive2://ip-10-0-0-21.eu-west-1.comput> )
0: jdbc:hive2://ip-10-0-0-21.eu-west-1.comput> ;
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--+
|                                                                                                                                                                                                                       Explain
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--+
| Vertex dependency in root stage
| Map 2 <- Map 1 (BROADCAST_EDGE)
| Reducer 3 <- Map 2 (SIMPLE_EDGE)
| Reducer 4 <- Map 2 (SIMPLE_EDGE)
|
| Stage-5
|    Stats-Aggr Operator
|       Stage-0
|          Move Operator
|             partition:{}
|             table:{"name:":"vault.contact","input format:":"org.apache.hadoop.hive.ql.io.orc.OrcInputFormat","output format:":"org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat","serde:":"org.apache.hadoop.hive.ql.io.orc.OrcSerde"}
|             Stage-3
|                Dependency Collection{}
|                   Stage-2
|                      Reducer 3
|                      File Output Operator [FS_1461]
|                         compressed:true
|                         Statistics:Num rows: 496014 Data size: 123507486 Basic stats: COMPLETE Column stats: PARTIAL
|                         table:{"name:":"vault.contact","input format:":"org.apache.hadoop.hive.ql.io.orc.OrcInputFormat","output format:":"org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat","serde:":"org.apache.hadoop.hive.ql.io.orc.OrcSerde"}
|                         Select Operator [SEL_1458]
|                         |  outputColumnNames:["_col0","_col1","_col2","_col3","_col4","_col5","_col6","_col7","_col8","_col9","_col10","_col11","_col12","_col13","_col14","_col15"]
|                         |  Statistics:Num rows: 496014 Data size: 123507486 Basic stats: COMPLETE Column stats: PARTIAL
|                         |<-Map 2 [SIMPLE_EDGE]
|                            Reduce Output Operator [RS_1457]
|                               key expressions:_col0 (type: struct<transactionid:bigint,bucketid:int,rowid:bigint>)
|                               Map-reduce partition columns:UDFToInteger(_col0) (type: int)
|                               sort order:+
|                               Statistics:Num rows: 496014 Data size: 79362240 Basic stats: COMPLETE Column stats: PARTIAL
|                               value expressions:_col1 (type: bigint), _col2 (type: string), _col3 (type: string), _col4 (type: string), _col5 (type: string), _col6 (type: string), _col7 (type: array<struct<group_id:bigint,campaign_id:bigint,member_since_ts_utc:bigint>>), _col8 (type: timestamp), _col9 (type: timestamp), _col10 (type: timestamp), _col11 (type: int), _col12 (type: bigint), _col13 (type: string), _col15 (type: bigint)
|                               Select Operator [SEL_1456]
|                                  outputColumnNames:["_col0","_col1","_col10","_col11","_col12","_col13","_col15","_col2","_col3","_col4","_col5","_col6","_col7","_col8","_col9"]
|                                  Statistics:Num rows: 496014 Data size: 79362240 Basic stats: COMPLETE Column stats: PARTIAL
|                                  Filter Operator [FIL_1463]
|                                     predicate:((_col13 = 'baarn') and (_col13 = _col29) and (_col14 = _col30) and (_col0 = _col18) and (_col7 < _col25)) (type: boolean)
|                                     Statistics:Num rows: 496014 Data size: 179061054 Basic stats: COMPLETE Column stats: PARTIAL
|                                     Map Join Operator [MAPJOIN_1474]
|                                     |  condition map:[{"":"Right Outer Join0 to 1"}]
|                                     |  HybridGraceHashJoin:true
|                                     |  keys:{"Map 2":"license_name (type: string), UDFToLong(campaign_id) (type: bigint), id (type: bigint)","Map 1":"license_name (type: string), campaign_id (type: bigint), id (type: bigint)"}
|                                     |  outputColumnNames:["_col0","_col7","_col8","_col9","_col13","_col14","_col17","_col18","_col19","_col20","_col21","_col22","_col23","_col24","_col25","_col26","_col27","_col28","_col29","_col30"]
|                                     |  Statistics:Num rows: 11904348 Data size: 24153922092 Basic stats: COMPLETE Column stats: PARTIAL
|                                     |<-Map 1 [BROADCAST_EDGE]
|                                     |  Reduce Output Operator [RS_1446]
|                                     |     key expressions:license_name (type: string), campaign_id (type: bigint), id (type: bigint)
|                                     |     Map-reduce partition columns:license_name (type: string), campaign_id (type: bigint), id (type: bigint)
|                                     |     sort order:+++
|                                     |     Statistics:Num rows: 621448 Data size: 79546063 Basic stats: COMPLETE Column stats: PARTIAL
|                                     |     value expressions:updated_on_utc (type: timestamp), created_on_utc (type: timestamp), deleted_on_utc (type: timestamp), ROW__ID (type: struct<transactionid:bigint,bucketid:int,rowid:bigint>)
|                                     |     TableScan [TS_1443]
|                                     |        ACID table:true
|                                     |        alias:dst
|                                     |        Statistics:Num rows: 621448 Data size: 79546063 Basic stats: COMPLETE Column stats: PARTIAL
|                                     |<-TableScan [TS_1444]
|                                           alias:src
|                                           Statistics:Num rows: 11904348 Data size: 29935728348 Basic stats: COMPLETE Column stats: PARTIAL
|                            Reduce Output Operator [RS_1451]
|                               Map-reduce partition columns:_col0 (type: bigint)
|                               sort order:
|                               Statistics:Num rows: 1 Data size: 188 Basic stats: COMPLETE Column stats: PARTIAL
|                               value expressions:_col0 (type: bigint), _col1 (type: string), _col2 (type: string), _col3 (type: string), _col4 (type: string), _col5 (type: string), _col6 (type: array<struct<group_id:bigint,campaign_id:bigint,member_since_ts_utc:bigint>>), _col7 (type: timestamp), _col10 (type: int), _col11 (type: bigint), _col12 (type: string), _col13 (type: string), _col14 (type: int)
|                               Select Operator [SEL_1450]
|                                  outputColumnNames:["_col0","_col1","_col10","_col11","_col12","_col13","_col14","_col2","_col3","_col4","_col5","_col6","_col7"]
|                                  Statistics:Num rows: 1 Data size: 188 Basic stats: COMPLETE Column stats: PARTIAL
|                                  Filter Operator [FIL_1464]
|                                     predicate:(_col13 is null and _col14 is null and _col0 is null) (type: boolean)
|                                     Statistics:Num rows: 1 Data size: 361 Basic stats: COMPLETE Column stats: PARTIAL
|                                      Please refer to the previous Map Join Operator [MAPJOIN_1474]
|                      Reducer 4
|                      File Output Operator [FS_1454]
|                         compressed:true
|                         Statistics:Num rows: 1 Data size: 188 Basic stats: COMPLETE Column stats: PARTIAL
|                         table:{"name:":"vault.contact","input format:":"org.apache.hadoop.hive.ql.io.orc.OrcInputFormat","output format:":"org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat","serde:":"org.apache.hadoop.hive.ql.io.orc.OrcSerde"}
|                         Select Operator [SEL_1452]
|                         |  outputColumnNames:["_col0","_col1","_col2","_col3","_col4","_col5","_col6","_col7","_col8","_col9","_col10","_col11","_col12","_col13","_col14"]
|                         |  Statistics:Num rows: 1 Data size: 188 Basic stats: COMPLETE Column stats: PARTIAL
|                         |<- Please refer to the previous Map 2 [SIMPLE_EDGE]
| Stage-4
|    Stats-Aggr Operator
|       Stage-1
|          Move Operator
|             partition:{}
|             table:{"name:":"vault.contact","input format:":"org.apache.hadoop.hive.ql.io.orc.OrcInputFormat","output format:":"org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat","serde:":"org.apache.hadoop.hive.ql.io.orc.OrcSerde"}
|              Please refer to the previous Stage-3
|
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--+
0: jdbc:hive2://ip-10-0-0-21.eu-west-1.comput>

Could anybody confirm or infirm from this explain that the partitions are properly pruned?

hdp 2.6, small (4 nodes) AWS cluster

6 REPLIES 6

Re: MERGE statement and partition pruning

Expert Contributor

I think for partition pruning to work you have to either something like WHERE license_name=X somewhere for static pruning or src.license_name=target.license_name for dynamic partition pruning. Otherwise there is nothing to to infer a smaller partition set form.

Re: MERGE statement and partition pruning

Expert Contributor

Thanks @Eugene Koifman. You are of course right, the issue is that I updated the MERGE text without reflecting the edits on the CREATE (now fixed in the questions). I am indeed using the partitions in the MERGE:

...
ON dst.license_name = src.license_name AND dst.campaign_id = src.campaign_id
...

but as far as I can tell, the pruning does not happen.

Re: MERGE statement and partition pruning

Expert Contributor

could you also post the plan with "hive.explain.user=true" please

Re: MERGE statement and partition pruning

Expert Contributor

I added this explain.

Re: MERGE statement and partition pruning

Cloudera Employee

Is dynamic partition pruning enabled?

set hive.tez.dynamic.partition.pruning=true;

Based on the dynamic partition pruning system tests - maybe also make sure the following settings are also enabled?

set hive.optimize.ppd=true;

set hive.optimize.index.filter=true;

Re: MERGE statement and partition pruning

Expert Contributor
  • hive.tez.dynamic.partition.pruning already globally true
  • hive.optimize.ppd true by default, I explicitly set it to true
  • hive.optimize.index.filter false by default, I set it to true
  • I set hive.tez.bucket.pruning to true as well.

I think that my issue is related to https://community.hortonworks.com/questions/142167/why-not-set-hivetezdynamicpartitionpruningmaxdata...

Thanks for your help!

Don't have an account?
Coming from Hortonworks? Activate your account here