Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Question about Partition Pruning in Impala

SOLVED Go to solution

Question about Partition Pruning in Impala

Rising Star

CDH 5.2.0 and 5.10.1

 

It seems partition pruning is not working when using "or" and conditions with both partitioned and un-partitioned columns.


Creates partitioned table with 3 partitions, 1 file for each partition.

 

create table part_table (
strcol1 string,
intcol1 int
)
PARTITIONED BY (
  keycol STRING
);

insert into part_table partition (keycol='k1') values ('str1', 1);
insert into part_table partition (keycol='k2') values ('str2', 2);
insert into part_table partition (keycol='k3') values ('str3', 3);

 

Partition pruning worked normally when no un-partitioned columns used 

explain select count(*) 
from part_table
where (
(keycol='k1')
or (keycol='k2' )
)

+-----------------------------------------------------+
| Explain String                                      |
+-----------------------------------------------------+
| Estimated Per-Host Requirements: Memory=0B VCores=0 |
|                                                     |
| PLAN-ROOT SINK                                      |
| |                                                   |
| 01:AGGREGATE [FINALIZE]                             |
| |  output: count(*)                                 |
| |                                                   |
| 00:SCAN HDFS [cdragg.part_table]                    |
|    partitions=2/3 files=2 size=14B                  |
+-----------------------------------------------------+

 

When un-partitioned columns used in conditions, no partition pruning performed.

explain select count(*)
from part_table
where (
(keycol='k0' and intcol1=0)
or (keycol='k2' and intcol1=2)
)
+-------------------------------------------------------------------------------------+
| Explain String                                                                      |
+-------------------------------------------------------------------------------------+
| Estimated Per-Host Requirements: Memory=0B VCores=0                                 |
|                                                                                     |
| PLAN-ROOT SINK                                                                      |
| |                                                                                   |
| 01:AGGREGATE [FINALIZE]                                                             |
| |  output: count(*)                                                                 |
| |                                                                                   |
| 00:SCAN HDFS [cdragg.part_table]                                                    |
|    partitions=3/3 files=3 size=21B                                                  |
|    predicates: ((keycol = 'k0' AND intcol1 = 0) OR (keycol = 'k2' AND intcol1 = 2)) |
+-------------------------------------------------------------------------------------+

I can workaround the situation by another condition on partitioned columns. Is this the proper way to do?

explain select count(*)
from part_table
where (
(keycol='k0' and intcol1=0)
or (keycol='k2' and intcol1=2)
)
and keycol in ('k0', 'k2')

+-------------------------------------------------------------------------------------+
| Explain String                                                                      |
+-------------------------------------------------------------------------------------+
| Estimated Per-Host Requirements: Memory=0B VCores=0                                 |
|                                                                                     |
| PLAN-ROOT SINK                                                                      |
| |                                                                                   |
| 01:AGGREGATE [FINALIZE]                                                             |
| |  output: count(*)                                                                 |
| |                                                                                   |
| 00:SCAN HDFS [cdragg.part_table]                                                    |
|    partitions=1/3 files=1 size=7B                                                   |
|    predicates: ((keycol = 'k0' AND intcol1 = 0) OR (keycol = 'k2' AND intcol1 = 2)) |
+-------------------------------------------------------------------------------------+

 

1 ACCEPTED SOLUTION

Accepted Solutions
Highlighted

Re: Question about Partition Pruning in Impala

Master Collaborator

You are correct, this is a limitation:

https://issues.cloudera.org/browse/IMPALA-2108

 

Impala would basically need to do infer the IN predicate that you are using in your workaround.

 

You are welcome to take a stab at contributing a patch!

1 REPLY 1
Highlighted

Re: Question about Partition Pruning in Impala

Master Collaborator

You are correct, this is a limitation:

https://issues.cloudera.org/browse/IMPALA-2108

 

Impala would basically need to do infer the IN predicate that you are using in your workaround.

 

You are welcome to take a stab at contributing a patch!