Created on 02-20-2019 10:02 PM - edited 09-16-2022 07:10 AM
cdh 6.1
impala 3.1
kudu 1.8
1 masternode + 5 datanode
all nodes 8 core 64g, ssd
create table test_like(id bigint, full_id string, primary key(id)) stored as kudu
-- insert about 1k records, what ever they are
......
select count(*) from test_like t1 left join test_like t2 on t1.full_id like t2.full_id
It takes 13 seconds to execute:
[data-60:21000] gslq4dev_iquantity> select count(*) from test_like t1 left join test_like t2 on t1.full_id like t2.full_id ; Query: select count(*) from test_like t1 left join test_like t2 on t1.full_id like t2.full_id Query submitted at: 2019-02-21 13:44:55 (Coordinator: http://data-60:25000) Query progress can be monitored at: http://data-60:25000/query_plan?query_id=594efea080891f13:9e932ad900000000 +----------+ | count(*) | +----------+ | 1023 | +----------+ Fetched 1 row(s) in 13.54s [data-60:21000] gslq4dev_iquantity> summary; +---------------------+--------+----------+----------+-------+------------+-----------+---------------+---------------------------------+ | Operator | #Hosts | Avg Time | Max Time | #Rows | Est. #Rows | Peak Mem | Est. Peak Mem | Detail | +---------------------+--------+----------+----------+-------+------------+-----------+---------------+---------------------------------+ | 06:AGGREGATE | 1 | 0ns | 0ns | 1 | 1 | 16.00 KB | 10.00 MB | FINALIZE | | 05:EXCHANGE | 1 | 0ns | 0ns | 1 | 1 | 16.00 KB | 16.00 KB | UNPARTITIONED | | 03:AGGREGATE | 1 | 0ns | 0ns | 1 | 1 | 267.00 KB | 10.00 MB | | | 02:NESTED LOOP JOIN | 1 | 13.42s | 13.42s | 1.02K | -1 | 279.00 KB | 2.00 GB | LEFT OUTER JOIN, BROADCAST | | |--04:EXCHANGE | 1 | 0ns | 0ns | 1.02K | -1 | 136.00 KB | 56.99 KB | BROADCAST | | | 01:SCAN KUDU | 1 | 4.00ms | 4.00ms | 1.02K | -1 | 127.00 KB | 384.00 KB | gslq4dev_iquantity.test_like t2 | | 00:SCAN KUDU | 1 | 0ns | 0ns | 1.02K | -1 | 127.00 KB | 384.00 KB | gslq4dev_iquantity.test_like t1 | +---------------------+--------+----------+----------+-------+------------+-----------+---------------+---------------------------------+
all the time spent on nested loop join
This is a demo sql , in the real scene I want to get some rows and all their children using the full_id field
full_id is in form of "1.2.3.4", and i have to use "like" operation to get the children( multi level)
The record count in real table is far more than 1k, and it never finish
why is it so slow? how to improve it?
Created 03-05-2019 04:20 PM
For non-equi joins, we use only NESTED LOOP JOIN. But for your use case, if you are only concerned with equivalence of the first 7 letters in the string 'full_id'. Then you can convert it into an equality predicate by doing something like : strleft(t1.full_id, 7) = strleft(t2.full_id, 7)
So the plan for the new query would result in a hash join, which can be faster for your case.
select count(*) from test_like t1
left join test_like t2 on strleft(t1.full_id, 7) = strleft(t2.full_id, 7)
+------------------------------------------------------------------------------------+ | | PLAN-ROOT SINK | | | | | 06:AGGREGATE [FINALIZE] | | | output: count:merge(*) | | | row-size=8B cardinality=1 | | | | | 05:EXCHANGE [UNPARTITIONED] | | | | | 03:AGGREGATE | | | output: count(*) | | | row-size=8B cardinality=1 | | | | | 02:HASH JOIN [LEFT OUTER JOIN, BROADCAST] | | | hash predicates: strleft(t1.full_id, 7) = strleft(t2.full_id, 7) | | | row-size=30B cardinality=unavailable | | | | | |--04:EXCHANGE [BROADCAST] | | | | | | | 01:SCAN KUDU [bik.test_like t2] | | | row-size=15B cardinality=unavailable | | | | | 00:SCAN KUDU [bik.test_like t1] | | row-size=15B cardinality=unavailable | +------------------------------------------------------------------------------------+
Created 03-06-2019 06:00 PM
I know it will be nested loop join
But the performance is too low, it look like the operator 'like' has bug in this situation
If i use 'locate' function instead of 'like', it's much more faster
---LIKE---
select count(*) from test_like t1 left join test_like t2 on t1.full_id like t2.full_id
Operator #Hosts Avg Time Max Time #Rows Est. #Rows Peak Mem Est. Peak Mem Detail -------------------------------------------------------------------------------------------------------------------------- 06:AGGREGATE 1 0.000ns 0.000ns 1 1 16.00 KB 10.00 MB FINALIZE 05:EXCHANGE 1 0.000ns 0.000ns 1 1 16.00 KB 16.00 KB UNPARTITIONED 03:AGGREGATE 1 16.000ms 16.000ms 1 1 135.00 KB 10.00 MB 02:NESTED LOOP JOIN 1 13s708ms 13s708ms 1.00M -1 147.00 KB 2.00 GB LEFT OUTER JOIN, BROADCAST |--04:EXCHANGE 1 0.000ns 0.000ns 1.00K -1 72.00 KB 56.99 KB BROADCAST | 01:SCAN KUDU 1 4.000ms 4.000ms 1.00K -1 59.00 KB 384.00 KB default.test_like t2 00:SCAN KUDU 1 36.001ms 36.001ms 1.00K -1 59.00 KB 384.00 KB default.test_like t1
---LOCATE---
select count(*) from test_like t1 left join test_like t2 on locate(t1.full_id, t2.full_id) > 0
Operator #Hosts Avg Time Max Time #Rows Est. #Rows Peak Mem Est. Peak Mem Detail ---------------------------------------------------------------------------------------------------------------------------- 06:AGGREGATE 1 0.000ns 0.000ns 1 1 16.00 KB 10.00 MB FINALIZE 05:EXCHANGE 1 0.000ns 0.000ns 1 1 16.00 KB 16.00 KB UNPARTITIONED 03:AGGREGATE 1 68.001ms 68.001ms 1 1 135.00 KB 10.00 MB 02:NESTED LOOP JOIN 1 460.009ms 460.009ms 1.00M -1 147.00 KB 2.00 GB LEFT OUTER JOIN, BROADCAST |--04:EXCHANGE 1 0.000ns 0.000ns 1.00K -1 72.00 KB 56.99 KB BROADCAST | 01:SCAN KUDU 1 4.000ms 4.000ms 1.00K -1 59.00 KB 384.00 KB default.test_like t2 00:SCAN KUDU 1 0.000ns 0.000ns 1.00K -1 59.00 KB 384.00 KB default.test_like t1