Created on 08-24-2017 03:53 PM - edited 09-16-2022 05:08 AM
I have a table with two columns test, id and yyyymmdd and group_id are my partition columns.
When I run following query it runs fast as it only scans 1 partition.
select id, yyyymmdd, group_id, test from dwh.table where
(id='1a' and yyyymmdd=20170815 and group_id=1)
But when I tried to run following . It scans entire table. Even explain shows that it is going to perform full table scan. I think I'm missing somethign simple here. Appreciate community's help to find out what I'm missing here ?
select id, yyyymmdd, group_id, test from dwh.table where
(id='1a' and yyyymmdd=20170815 and group_id=1) OR
(id='2b' and yyyymmdd=20170811 and group_id=2)
How to scan two rows in two different partitions ?
Created 08-24-2017 04:01 PM
I'm afraid Impala is not yet able to recognize that only two partitions need to be scanned. We're aware of the gap and that specific optimization is tracked by:
https://issues.apache.org/jira/browse/IMPALA-2108
For now, you can manually rewrite your query as suggested in the JIRA as follows:
select id, yyyymmdd, group_id, test from dwh.table where
((id='1a' and yyyymmdd=20170815 and group_id=1) OR (id='2b' and yyyymmdd=20170811 and group_id=2))
AND
((yyyymmdd=20170811 and group_id=2) OR (yyyymmdd=20170815 and group_id=1))
or alternatively, use a union:
select id, yyyymmdd, group_id, test from dwh.table where
id='1a' and yyyymmdd=20170815 and group_id=1
union all
select id, yyyymmdd, group_id, test from dwh.table where
id='2b' and yyyymmdd=20170811 and group_id=2
Created 08-24-2017 04:01 PM
I'm afraid Impala is not yet able to recognize that only two partitions need to be scanned. We're aware of the gap and that specific optimization is tracked by:
https://issues.apache.org/jira/browse/IMPALA-2108
For now, you can manually rewrite your query as suggested in the JIRA as follows:
select id, yyyymmdd, group_id, test from dwh.table where
((id='1a' and yyyymmdd=20170815 and group_id=1) OR (id='2b' and yyyymmdd=20170811 and group_id=2))
AND
((yyyymmdd=20170811 and group_id=2) OR (yyyymmdd=20170815 and group_id=1))
or alternatively, use a union:
select id, yyyymmdd, group_id, test from dwh.table where
id='1a' and yyyymmdd=20170815 and group_id=1
union all
select id, yyyymmdd, group_id, test from dwh.table where
id='2b' and yyyymmdd=20170811 and group_id=2