Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Question about Partition Pruning in Impala

avatar
Expert Contributor

CDH 5.2.0 and 5.10.1

 

It seems partition pruning is not working when using "or" and conditions with both partitioned and un-partitioned columns.


Creates partitioned table with 3 partitions, 1 file for each partition.

 

create table part_table (
strcol1 string,
intcol1 int
)
PARTITIONED BY (
  keycol STRING
);

insert into part_table partition (keycol='k1') values ('str1', 1);
insert into part_table partition (keycol='k2') values ('str2', 2);
insert into part_table partition (keycol='k3') values ('str3', 3);

 

Partition pruning worked normally when no un-partitioned columns used 

explain select count(*) 
from part_table
where (
(keycol='k1')
or (keycol='k2' )
)

+-----------------------------------------------------+
| Explain String                                      |
+-----------------------------------------------------+
| Estimated Per-Host Requirements: Memory=0B VCores=0 |
|                                                     |
| PLAN-ROOT SINK                                      |
| |                                                   |
| 01:AGGREGATE [FINALIZE]                             |
| |  output: count(*)                                 |
| |                                                   |
| 00:SCAN HDFS [cdragg.part_table]                    |
|    partitions=2/3 files=2 size=14B                  |
+-----------------------------------------------------+

 

When un-partitioned columns used in conditions, no partition pruning performed.

explain select count(*)
from part_table
where (
(keycol='k0' and intcol1=0)
or (keycol='k2' and intcol1=2)
)
+-------------------------------------------------------------------------------------+
| Explain String                                                                      |
+-------------------------------------------------------------------------------------+
| Estimated Per-Host Requirements: Memory=0B VCores=0                                 |
|                                                                                     |
| PLAN-ROOT SINK                                                                      |
| |                                                                                   |
| 01:AGGREGATE [FINALIZE]                                                             |
| |  output: count(*)                                                                 |
| |                                                                                   |
| 00:SCAN HDFS [cdragg.part_table]                                                    |
|    partitions=3/3 files=3 size=21B                                                  |
|    predicates: ((keycol = 'k0' AND intcol1 = 0) OR (keycol = 'k2' AND intcol1 = 2)) |
+-------------------------------------------------------------------------------------+

I can workaround the situation by another condition on partitioned columns. Is this the proper way to do?

explain select count(*)
from part_table
where (
(keycol='k0' and intcol1=0)
or (keycol='k2' and intcol1=2)
)
and keycol in ('k0', 'k2')

+-------------------------------------------------------------------------------------+
| Explain String                                                                      |
+-------------------------------------------------------------------------------------+
| Estimated Per-Host Requirements: Memory=0B VCores=0                                 |
|                                                                                     |
| PLAN-ROOT SINK                                                                      |
| |                                                                                   |
| 01:AGGREGATE [FINALIZE]                                                             |
| |  output: count(*)                                                                 |
| |                                                                                   |
| 00:SCAN HDFS [cdragg.part_table]                                                    |
|    partitions=1/3 files=1 size=7B                                                   |
|    predicates: ((keycol = 'k0' AND intcol1 = 0) OR (keycol = 'k2' AND intcol1 = 2)) |
+-------------------------------------------------------------------------------------+

 

1 ACCEPTED SOLUTION

avatar

You are correct, this is a limitation:

https://issues.cloudera.org/browse/IMPALA-2108

 

Impala would basically need to do infer the IN predicate that you are using in your workaround.

 

You are welcome to take a stab at contributing a patch!

View solution in original post

1 REPLY 1

avatar

You are correct, this is a limitation:

https://issues.cloudera.org/browse/IMPALA-2108

 

Impala would basically need to do infer the IN predicate that you are using in your workaround.

 

You are welcome to take a stab at contributing a patch!