Support Questions

Find answers, ask questions, and share your expertise

Impala runtime filter not working as expected

avatar
Explorer

Hey all,

I'm using CDH 5.9, Impala 2.7.

I'm examining the runtime-filter feature, but it's not working as I expected.

 

Here's an example of my case:

 

-- Table 1 - Partitioned by day
CREATE TABLE adb.test_prt (ID string)
partitioned BY (day INT);

INSERT INTO adb.test_prt PARTITION(day)
SELECT 'a' AS ID ,20170102 AS day
UNION all
SELECT 'b' AS ID ,20170102 AS day
UNION ALL
SELECT 'c' AS ID,20170102 AS day
UNION ALL
SELECT 'd' AS ID ,20170103 AS day
UNION ALL
SELECT 'd' AS ID ,20170105 AS day
UNION all
SELECT 'g' AS ID ,20170105 AS day


SELECT * FROM adb.test_prt
show partitions adb.test_prt

-- Table 2 - raw data
CREATE TABLE adb.test_1 (day INT)

INSERT INTO adb.test_1
SELECT 20170102 AS day

SELECT * FROM adb.test_1

 

--###################################
-- explain 1
explain
SELECT *
FROM adb.test_prt t
WHERE t.day IN (SELECT day FROM adb.test_1)

 

output : partitions=3/3 files=3 size=12B

 

-- explain 2

explain
SELECT *
FROM adb.test_prt t
WHERE t.day IN (20170102)

 

output:  partitions=1/3 files=1 size=6B

 

I don't understand why there is a difference between the outputs. Table adb.test_1 has only one value which match to specific partition in adb.test_prt. I'm expecting from the runtime filter to figure this out.

What am I missing?

 

Another question: Is this feature support joins as well, rather then where clause?

Here's an example

 

explain
SELECT *
FROM adb.test_prt t
inner JOIN adb.test_1 a
ON t.day=a.day

 

output:  partitions=3/3 files=3 size=12B
runtime filters: RF000 -> t.day

 

Thanks!

Dror

1 ACCEPTED SOLUTION

avatar
Expert Contributor
Can you provide the whole of your 'explain 1'? I would expect to see
runtime filtering enabled in the plan.

The number of partitions scanned ('partitions=3/3') is correct. Runtime
filtering happens at, well, runtime, so the planner doesn't know how many
partitions it's going to skip. In your second example, the planner can
figure this out ahead of time.

Henry

View solution in original post

10 REPLIES 10

avatar
Expert Contributor
Can you provide the whole of your 'explain 1'? I would expect to see
runtime filtering enabled in the plan.

The number of partitions scanned ('partitions=3/3') is correct. Runtime
filtering happens at, well, runtime, so the planner doesn't know how many
partitions it's going to skip. In your second example, the planner can
figure this out ahead of time.

Henry

avatar
Explorer

Hey,

Thanks for the quick response.
How can I check the actual partitions number that been read?

 

I read the profile output for this query, but there is to much information there.

 

Here's the output for the explain statement for EXPALIN_1 step.

It seems like the runtime filter enbaled.

 

Estimated Per-Host Requirements: Memory=2.03GB VCores=2
WARNING: The following tables are missing relevant table and/or column statistics.
adb.test_1, adb.test_prt

04:EXCHANGE [UNPARTITIONED]
|
02:HASH JOIN [LEFT SEMI JOIN, BROADCAST]
| hash predicates: t.day = day
| runtime filters: RF000 <- day
|
|--03:EXCHANGE [BROADCAST]
| |
| 01:SCAN HDFS [adb.test_1]
| partitions=1/1 files=1 size=9B
|
00:SCAN HDFS [adb.test_prt t]
partitions=3/3 files=3 size=12B
runtime filters: RF000 -> t.day

 

 

Thanks!

avatar
Expert Contributor
The profile has the information. Look for something like:

Filter 0 (1.00 MB):
- Files processed: 76 (76)
- Files rejected: 76 (76)
- Files total: 76 (76)
- RowGroups processed: 0 (0)
- RowGroups rejected: 0 (0)
- RowGroups total: 0 (0)
- Rows processed: 0 (0)
- Rows rejected: 0 (0)
- Rows total: 0 (0)
- Splits processed: 0 (0)
- Splits rejected: 0 (0)
- Splits total: 0 (0)

We don't record per-partition filtering information - but instead we
eliminate entire files, so that's captured in the 'Files' statistics below.

The 'total' is all files considered for scanning. The 'processed' value is
the total number that were submitted to that filter - if the filter arrives
late, some files could be read but not considered for filtering. The
'rejected' value is the number of files that were eliminated by the filter.
High 'rejected' numbers means an effective filter!

HTH,
Henry

avatar
Explorer

Many thanks Henry!

Last question - When using joins, does runtime filter works on INNER JOIN solely?
When I checked the profile output after performing LEFT JOIN I didn't see any mention for runtime filter.

 

Thanks

avatar
Expert Contributor
Filters - as implemented - don't work in the presence of LEFT OUTER JOIN
because the filter values are passed from right-to-left. The right side of
the join computes the set of filter values, and then the left side is
compared against those filter values. But in a LEFT JOIN all left-side
values should be output by the join, so no filtering can happen.

If you try the same query but with a RIGHT OUTER JOIN, you should see the
filters kick back in.

Henry

avatar
Rising Star

Hi Henry,

 

I have a question for you and it is about `partition pruning` ( about pruning )

 

Let's say there are two tables A and B.

And, each table is partitioned by yearweek.

And, here is the query I'd like to run. ( Yes. I need to use left join to get result what I want )

SELECT * FROM A
LEFT OUTER JOIN B
ON A.account_id = B.account_id
AND A.yearweek = 201710 and B.yearweek = 201710

Even this doesn't select specific partition in `table A` 

SELECT * FROM A
LEFT OUTER JOIN B
ON A.account_id = B.account_id
AND B.yearweek = 201710
WHERE A.yearweek = 201710

 

Like you said, `A.yearweek` = 201710 on `On clause` couldn't select partition yearweek=201710.

This might be filter is applied from `Left to Right`.

In order to select specific partition for `table A`. I used `dynamic partition` and updated query like this.

SELECT * FROM (SELECT * FROM A WHERE yearweek = 201710) a
LEFT OUTER JOIN B b
ON a.account_id = b.account_id
AND b.yearweek = 201710

Do you think this is best I can do?

Or is there way to limit data for `table A` by using `On clause`?

 

And, is there any refrerence you would recommend for me to upderstand how JOIN works in Impala and Hive?

 

Thank you very much in advance.

 

Gatsby

avatar

Your second query variant should select the partition in table A. If that's not the case, something seems wrong. Could you provide a profile that shows the lack of partitioning pruning with the second query in your use case?

 

The the first and second query variant have a different meaning, and hence different optimizations apply. The ON and WHERE clauses have very specific meanings in SQL, in particular with outer joins. The ON clause affects which rows are considered a "match" for he purpose of the outer join, so if you put "A.yearweek = 201710" in the ON clause, then those rows not satisfying that condition are considered a join non-match. The meaning of a LEFT OUTER JOIN is that the left side rows are returned even for join non-matches (with NULLs on the right side).

 

The WHERE clause is logically applied *after* the FROM clause, so all rows produced by the FROM clause are filtered (including non-matches of the outer join, so we can move a WHERE-clause predicate on A into the scan in your second query variant).

 

 

avatar
Rising Star

Alex,

 

First of all, thank you very much for your explanation.

 

You're right. the second query selects partition in table A.

And, I'm fully aware of the difference between the first one( using on clause ) and second on ( where clause ) like the way you explained.

 

The reason different variances were tried to find out ways to limit table A data before joinging two tables.

( yes, second query doesn't work this way )

In MySQL, table A could be limited by `ON clause`, but with Impala, I don't know how to do it.

 

Do you think using subquery is the best way?

 

Thank you

Gatsby

avatar

Hi Gatsby,

 

if your goal is to limit the rows of table A then I think the subquery is the safest bet. Your second variant with the WHERE clause also seems fine, but depending on what's in the FROM clause that predicate may not always be applied at the scan of A. So going with a subquery seems the most straightforward.

 

Runtime filters from right-to-left on a LEFT OUTER JOIN are not possible because restrictoins on the right side cannot be directly applied to the left side.

 

I would be surprised if MySQL had a different behavior with respect to the ON clause. Do you have an example of MySQL results being different from Impala's?

 

Alex