Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Impala query to scan two records in different partitions

avatar
Contributor

I have a table with two columns test, id and yyyymmdd and group_id are my partition columns.

 

When I run following query it runs fast as it only scans 1 partition.

 

select id, yyyymmdd, group_id, test from dwh.table where
(id='1a' and yyyymmdd=20170815 and group_id=1)

 

But when I tried to run following . It scans entire table. Even explain shows that it is going to perform full table scan. I think I'm missing somethign simple here. Appreciate community's help to find out what I'm missing here ?

 

select id, yyyymmdd, group_id, test from dwh.table where
(id='1a' and yyyymmdd=20170815 and group_id=1) OR
(id='2b' and yyyymmdd=20170811 and group_id=2)

 

How to scan two rows in two different partitions ?

 

 

 

1 ACCEPTED SOLUTION

avatar

I'm afraid Impala is not yet able to recognize that only two partitions need to be scanned. We're aware of the gap and that specific optimization is tracked by:

https://issues.apache.org/jira/browse/IMPALA-2108

 

For now, you can manually rewrite your query as suggested in the JIRA as follows:

select id, yyyymmdd, group_id, test from dwh.table where

((id='1a' and yyyymmdd=20170815 and group_id=1) OR (id='2b' and yyyymmdd=20170811 and group_id=2))

AND

((yyyymmdd=20170811 and group_id=2) OR (yyyymmdd=20170815 and group_id=1))

 

or alternatively, use a union:

 

select id, yyyymmdd, group_id, test from dwh.table where
id='1a' and yyyymmdd=20170815 and group_id=1

union all

select id, yyyymmdd, group_id, test from dwh.table where

id='2b' and yyyymmdd=20170811 and group_id=2

 

View solution in original post

1 REPLY 1

avatar

I'm afraid Impala is not yet able to recognize that only two partitions need to be scanned. We're aware of the gap and that specific optimization is tracked by:

https://issues.apache.org/jira/browse/IMPALA-2108

 

For now, you can manually rewrite your query as suggested in the JIRA as follows:

select id, yyyymmdd, group_id, test from dwh.table where

((id='1a' and yyyymmdd=20170815 and group_id=1) OR (id='2b' and yyyymmdd=20170811 and group_id=2))

AND

((yyyymmdd=20170811 and group_id=2) OR (yyyymmdd=20170815 and group_id=1))

 

or alternatively, use a union:

 

select id, yyyymmdd, group_id, test from dwh.table where
id='1a' and yyyymmdd=20170815 and group_id=1

union all

select id, yyyymmdd, group_id, test from dwh.table where

id='2b' and yyyymmdd=20170811 and group_id=2