Archives of Support Questions (Read Only)

This is an archived board for historical reference. Information and links may no longer be available or relevant
Announcements
This board is archived and read-only for historical reference. To ask a new question, please post a new topic on the appropriate active board.

Question about Partition Pruning in Impala

avatar
Expert Contributor

CDH 5.2.0 and 5.10.1

 

It seems partition pruning is not working when using "or" and conditions with both partitioned and un-partitioned columns.


Creates partitioned table with 3 partitions, 1 file for each partition.

 

create table part_table (
strcol1 string,
intcol1 int
)
PARTITIONED BY (
  keycol STRING
);

insert into part_table partition (keycol='k1') values ('str1', 1);
insert into part_table partition (keycol='k2') values ('str2', 2);
insert into part_table partition (keycol='k3') values ('str3', 3);

 

Partition pruning worked normally when no un-partitioned columns used 

explain select count(*) 
from part_table
where (
(keycol='k1')
or (keycol='k2' )
)

+-----------------------------------------------------+
| Explain String                                      |
+-----------------------------------------------------+
| Estimated Per-Host Requirements: Memory=0B VCores=0 |
|                                                     |
| PLAN-ROOT SINK                                      |
| |                                                   |
| 01:AGGREGATE [FINALIZE]                             |
| |  output: count(*)                                 |
| |                                                   |
| 00:SCAN HDFS [cdragg.part_table]                    |
|    partitions=2/3 files=2 size=14B                  |
+-----------------------------------------------------+

 

When un-partitioned columns used in conditions, no partition pruning performed.

explain select count(*)
from part_table
where (
(keycol='k0' and intcol1=0)
or (keycol='k2' and intcol1=2)
)
+-------------------------------------------------------------------------------------+
| Explain String                                                                      |
+-------------------------------------------------------------------------------------+
| Estimated Per-Host Requirements: Memory=0B VCores=0                                 |
|                                                                                     |
| PLAN-ROOT SINK                                                                      |
| |                                                                                   |
| 01:AGGREGATE [FINALIZE]                                                             |
| |  output: count(*)                                                                 |
| |                                                                                   |
| 00:SCAN HDFS [cdragg.part_table]                                                    |
|    partitions=3/3 files=3 size=21B                                                  |
|    predicates: ((keycol = 'k0' AND intcol1 = 0) OR (keycol = 'k2' AND intcol1 = 2)) |
+-------------------------------------------------------------------------------------+

I can workaround the situation by another condition on partitioned columns. Is this the proper way to do?

explain select count(*)
from part_table
where (
(keycol='k0' and intcol1=0)
or (keycol='k2' and intcol1=2)
)
and keycol in ('k0', 'k2')

+-------------------------------------------------------------------------------------+
| Explain String                                                                      |
+-------------------------------------------------------------------------------------+
| Estimated Per-Host Requirements: Memory=0B VCores=0                                 |
|                                                                                     |
| PLAN-ROOT SINK                                                                      |
| |                                                                                   |
| 01:AGGREGATE [FINALIZE]                                                             |
| |  output: count(*)                                                                 |
| |                                                                                   |
| 00:SCAN HDFS [cdragg.part_table]                                                    |
|    partitions=1/3 files=1 size=7B                                                   |
|    predicates: ((keycol = 'k0' AND intcol1 = 0) OR (keycol = 'k2' AND intcol1 = 2)) |
+-------------------------------------------------------------------------------------+

 

1 ACCEPTED SOLUTION

avatar

You are correct, this is a limitation:

https://issues.cloudera.org/browse/IMPALA-2108

 

Impala would basically need to do infer the IN predicate that you are using in your workaround.

 

You are welcome to take a stab at contributing a patch!

View solution in original post

1 REPLY 1

avatar

You are correct, this is a limitation:

https://issues.cloudera.org/browse/IMPALA-2108

 

Impala would basically need to do infer the IN predicate that you are using in your workaround.

 

You are welcome to take a stab at contributing a patch!