Created on 01-17-2017 12:49 AM - edited 09-16-2022 03:55 AM
Hey all,
I'm using CDH 5.9, Impala 2.7.
I'm examining the runtime-filter feature, but it's not working as I expected.
Here's an example of my case:
-- Table 1 - Partitioned by day
CREATE TABLE adb.test_prt (ID string)
partitioned BY (day INT);
INSERT INTO adb.test_prt PARTITION(day)
SELECT 'a' AS ID ,20170102 AS day
UNION all
SELECT 'b' AS ID ,20170102 AS day
UNION ALL
SELECT 'c' AS ID,20170102 AS day
UNION ALL
SELECT 'd' AS ID ,20170103 AS day
UNION ALL
SELECT 'd' AS ID ,20170105 AS day
UNION all
SELECT 'g' AS ID ,20170105 AS day
SELECT * FROM adb.test_prt
show partitions adb.test_prt
-- Table 2 - raw data
CREATE TABLE adb.test_1 (day INT)
INSERT INTO adb.test_1
SELECT 20170102 AS day
SELECT * FROM adb.test_1
--###################################
-- explain 1
explain
SELECT *
FROM adb.test_prt t
WHERE t.day IN (SELECT day FROM adb.test_1)
output : partitions=3/3 files=3 size=12B
-- explain 2
explain
SELECT *
FROM adb.test_prt t
WHERE t.day IN (20170102)
output: partitions=1/3 files=1 size=6B
I don't understand why there is a difference between the outputs. Table adb.test_1 has only one value which match to specific partition in adb.test_prt. I'm expecting from the runtime filter to figure this out.
What am I missing?
Another question: Is this feature support joins as well, rather then where clause?
Here's an example
explain
SELECT *
FROM adb.test_prt t
inner JOIN adb.test_1 a
ON t.day=a.day
output: partitions=3/3 files=3 size=12B
runtime filters: RF000 -> t.day
Thanks!
Dror
Created 01-17-2017 10:14 AM
Created 01-17-2017 10:14 AM
Created 01-17-2017 11:18 PM
Hey,
Thanks for the quick response.
How can I check the actual partitions number that been read?
I read the profile output for this query, but there is to much information there.
Here's the output for the explain statement for EXPALIN_1 step.
It seems like the runtime filter enbaled.
Estimated Per-Host Requirements: Memory=2.03GB VCores=2
WARNING: The following tables are missing relevant table and/or column statistics.
adb.test_1, adb.test_prt
04:EXCHANGE [UNPARTITIONED]
|
02:HASH JOIN [LEFT SEMI JOIN, BROADCAST]
| hash predicates: t.day = day
| runtime filters: RF000 <- day
|
|--03:EXCHANGE [BROADCAST]
| |
| 01:SCAN HDFS [adb.test_1]
| partitions=1/1 files=1 size=9B
|
00:SCAN HDFS [adb.test_prt t]
partitions=3/3 files=3 size=12B
runtime filters: RF000 -> t.day
Thanks!
Created 01-18-2017 12:06 PM
Created 01-19-2017 01:50 AM
Many thanks Henry!
Last question - When using joins, does runtime filter works on INNER JOIN solely?
When I checked the profile output after performing LEFT JOIN I didn't see any mention for runtime filter.
Thanks
Created 01-19-2017 09:39 AM
Created 03-20-2017 04:18 PM
Hi Henry,
I have a question for you and it is about `partition pruning` ( about pruning )
Let's say there are two tables A and B.
And, each table is partitioned by yearweek.
And, here is the query I'd like to run. ( Yes. I need to use left join to get result what I want )
SELECT * FROM A LEFT OUTER JOIN B ON A.account_id = B.account_id AND A.yearweek = 201710 and B.yearweek = 201710
Even this doesn't select specific partition in `table A`
SELECT * FROM A LEFT OUTER JOIN B ON A.account_id = B.account_id AND B.yearweek = 201710 WHERE A.yearweek = 201710
Like you said, `A.yearweek` = 201710 on `On clause` couldn't select partition yearweek=201710.
This might be filter is applied from `Left to Right`.
In order to select specific partition for `table A`. I used `dynamic partition` and updated query like this.
SELECT * FROM (SELECT * FROM A WHERE yearweek = 201710) a LEFT OUTER JOIN B b ON a.account_id = b.account_id AND b.yearweek = 201710
Do you think this is best I can do?
Or is there way to limit data for `table A` by using `On clause`?
And, is there any refrerence you would recommend for me to upderstand how JOIN works in Impala and Hive?
Thank you very much in advance.
Gatsby
Created 03-20-2017 09:19 PM
Your second query variant should select the partition in table A. If that's not the case, something seems wrong. Could you provide a profile that shows the lack of partitioning pruning with the second query in your use case?
The the first and second query variant have a different meaning, and hence different optimizations apply. The ON and WHERE clauses have very specific meanings in SQL, in particular with outer joins. The ON clause affects which rows are considered a "match" for he purpose of the outer join, so if you put "A.yearweek = 201710" in the ON clause, then those rows not satisfying that condition are considered a join non-match. The meaning of a LEFT OUTER JOIN is that the left side rows are returned even for join non-matches (with NULLs on the right side).
The WHERE clause is logically applied *after* the FROM clause, so all rows produced by the FROM clause are filtered (including non-matches of the outer join, so we can move a WHERE-clause predicate on A into the scan in your second query variant).
Created 03-20-2017 11:09 PM
Alex,
First of all, thank you very much for your explanation.
You're right. the second query selects partition in table A.
And, I'm fully aware of the difference between the first one( using on clause ) and second on ( where clause ) like the way you explained.
The reason different variances were tried to find out ways to limit table A data before joinging two tables.
( yes, second query doesn't work this way )
In MySQL, table A could be limited by `ON clause`, but with Impala, I don't know how to do it.
Do you think using subquery is the best way?
Thank you
Gatsby
Created 03-21-2017 12:00 AM
Hi Gatsby,
if your goal is to limit the rows of table A then I think the subquery is the safest bet. Your second variant with the WHERE clause also seems fine, but depending on what's in the FROM clause that predicate may not always be applied at the scan of A. So going with a subquery seems the most straightforward.
Runtime filters from right-to-left on a LEFT OUTER JOIN are not possible because restrictoins on the right side cannot be directly applied to the left side.
I would be surprised if MySQL had a different behavior with respect to the ON clause. Do you have an example of MySQL results being different from Impala's?
Alex