Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

very slow when use "like" operation in join clause

avatar
Explorer

cdh 6.1

impala 3.1

kudu 1.8

1 masternode + 5 datanode

all nodes 8 core 64g, ssd

 

create table test_like(id bigint, full_id string, primary key(id)) stored as kudu

-- insert about 1k records, what ever they are
......
select count(*) from test_like t1 left join test_like t2 on t1.full_id like t2.full_id

 

It takes 13 seconds to execute:

 

[data-60:21000] gslq4dev_iquantity>   select count(*) from test_like t1
  left join test_like t2 on t1.full_id like t2.full_id
  ;
Query: select count(*) from test_like t1
  left join test_like t2 on t1.full_id like t2.full_id
Query submitted at: 2019-02-21 13:44:55 (Coordinator: http://data-60:25000)
Query progress can be monitored at: http://data-60:25000/query_plan?query_id=594efea080891f13:9e932ad900000000
+----------+
| count(*) |
+----------+
| 1023     |
+----------+
Fetched 1 row(s) in 13.54s
[data-60:21000] gslq4dev_iquantity> summary;
+---------------------+--------+----------+----------+-------+------------+-----------+---------------+---------------------------------+
| Operator            | #Hosts | Avg Time | Max Time | #Rows | Est. #Rows | Peak Mem  | Est. Peak Mem | Detail                          |
+---------------------+--------+----------+----------+-------+------------+-----------+---------------+---------------------------------+
| 06:AGGREGATE        | 1      | 0ns      | 0ns      | 1     | 1          | 16.00 KB  | 10.00 MB      | FINALIZE                        |
| 05:EXCHANGE         | 1      | 0ns      | 0ns      | 1     | 1          | 16.00 KB  | 16.00 KB      | UNPARTITIONED                   |
| 03:AGGREGATE        | 1      | 0ns      | 0ns      | 1     | 1          | 267.00 KB | 10.00 MB      |                                 |
| 02:NESTED LOOP JOIN | 1      | 13.42s   | 13.42s   | 1.02K | -1         | 279.00 KB | 2.00 GB       | LEFT OUTER JOIN, BROADCAST      |
| |--04:EXCHANGE      | 1      | 0ns      | 0ns      | 1.02K | -1         | 136.00 KB | 56.99 KB      | BROADCAST                       |
| |  01:SCAN KUDU     | 1      | 4.00ms   | 4.00ms   | 1.02K | -1         | 127.00 KB | 384.00 KB     | gslq4dev_iquantity.test_like t2 |
| 00:SCAN KUDU        | 1      | 0ns      | 0ns      | 1.02K | -1         | 127.00 KB | 384.00 KB     | gslq4dev_iquantity.test_like t1 |
+---------------------+--------+----------+----------+-------+------------+-----------+---------------+---------------------------------+

all the time spent on nested loop join

 

This is a demo sql , in the real scene I want to get some rows and all their children using the full_id field

full_id is in form of "1.2.3.4", and i have to use "like" operation to get the children( multi level)

The record count in real table is far more than 1k, and it never finish

 

why is it so slow?  how to improve it?

 

 

 

2 REPLIES 2

avatar
Cloudera Employee

For non-equi joins, we use only NESTED LOOP JOIN. But for your use case, if you are only concerned with equivalence of the first 7 letters in the string 'full_id'. Then you can convert it into an equality predicate by doing something like : strleft(t1.full_id, 7) = strleft(t2.full_id, 7)

 

So the plan for the new query would result in a hash join, which can be faster for your case.

select count(*) from test_like t1
left join test_like t2 on strleft(t1.full_id, 7) = strleft(t2.full_id, 7)

 

 

+------------------------------------------------------------------------------------+                                                                           |
| PLAN-ROOT SINK                                                                     |
| |                                                                                  |
| 06:AGGREGATE [FINALIZE]                                                            |
| |  output: count:merge(*)                                                          |
| |  row-size=8B cardinality=1                                                       |
| |                                                                                  |
| 05:EXCHANGE [UNPARTITIONED]                                                        |
| |                                                                                  |
| 03:AGGREGATE                                                                       |
| |  output: count(*)                                                                |
| |  row-size=8B cardinality=1                                                       |
| |                                                                                  |
| 02:HASH JOIN [LEFT OUTER JOIN, BROADCAST]                                          |
| |  hash predicates: strleft(t1.full_id, 7) = strleft(t2.full_id, 7)                |
| |  row-size=30B cardinality=unavailable                                            |
| |                                                                                  |
| |--04:EXCHANGE [BROADCAST]                                                         |
| |  |                                                                               |
| |  01:SCAN KUDU [bik.test_like t2]                                                 |
| |     row-size=15B cardinality=unavailable                                         |
| |                                                                                  |
| 00:SCAN KUDU [bik.test_like t1]                                                    |
|    row-size=15B cardinality=unavailable                                            |
+------------------------------------------------------------------------------------+

 

avatar
Explorer

I know it will be nested loop join

But the performance is too low, it look like the operator 'like' has bug in this situation

If i use 'locate' function instead of 'like', it's much more faster

 

---LIKE---

select count(*) from test_like t1
left join test_like t2 on t1.full_id like t2.full_id
Operator              #Hosts  Avg Time  Max Time  #Rows  Est. #Rows   Peak Mem  Est. Peak Mem  Detail                     
--------------------------------------------------------------------------------------------------------------------------
06:AGGREGATE               1   0.000ns   0.000ns      1           1   16.00 KB       10.00 MB  FINALIZE                   
05:EXCHANGE                1   0.000ns   0.000ns      1           1   16.00 KB       16.00 KB  UNPARTITIONED              
03:AGGREGATE               1  16.000ms  16.000ms      1           1  135.00 KB       10.00 MB                             
02:NESTED LOOP JOIN        1  13s708ms  13s708ms  1.00M          -1  147.00 KB        2.00 GB  LEFT OUTER JOIN, BROADCAST 
|--04:EXCHANGE             1   0.000ns   0.000ns  1.00K          -1   72.00 KB       56.99 KB  BROADCAST                  
|  01:SCAN KUDU            1   4.000ms   4.000ms  1.00K          -1   59.00 KB      384.00 KB  default.test_like t2       
00:SCAN KUDU               1  36.001ms  36.001ms  1.00K          -1   59.00 KB      384.00 KB  default.test_like t1

---LOCATE---

select count(*) from test_like t1
left join test_like t2 on locate(t1.full_id, t2.full_id) > 0
Operator              #Hosts   Avg Time   Max Time  #Rows  Est. #Rows   Peak Mem  Est. Peak Mem  Detail                     
----------------------------------------------------------------------------------------------------------------------------
06:AGGREGATE               1    0.000ns    0.000ns      1           1   16.00 KB       10.00 MB  FINALIZE                   
05:EXCHANGE                1    0.000ns    0.000ns      1           1   16.00 KB       16.00 KB  UNPARTITIONED              
03:AGGREGATE               1   68.001ms   68.001ms      1           1  135.00 KB       10.00 MB                             
02:NESTED LOOP JOIN        1  460.009ms  460.009ms  1.00M          -1  147.00 KB        2.00 GB  LEFT OUTER JOIN, BROADCAST 
|--04:EXCHANGE             1    0.000ns    0.000ns  1.00K          -1   72.00 KB       56.99 KB  BROADCAST                  
|  01:SCAN KUDU            1    4.000ms    4.000ms  1.00K          -1   59.00 KB      384.00 KB  default.test_like t2       
00:SCAN KUDU               1    0.000ns    0.000ns  1.00K          -1   59.00 KB      384.00 KB  default.test_like t1