Reply
New Contributor
Posts: 2
Registered: ‎02-10-2017

Impala Queries slow using GROUP BY and LIKE

Hi,

 

We are testing Apache Impala and have noticed that using GROUP BY and LIKE together works very slowly -- separate queries work much faster. Here are two examples:

 

SELECT * FROM hive.default.pcopy1B where (lower("by") like '%part%' and lower("by") like '%and%' and lower("by") like '%the%') or (lower(title) like '%part%' and lower(title) like '%and%' and lower(title) like '%the%') or (lower(url) like '%part%' and lower(url) like '%and%' and lower(url) like '%the%') or (lower(text) like '%part%' and lower(text) like '%and%' and lower(text) like '%the%') limit 100;


1.37s 1.08s 1.35s

select "by", type, ranking, count(*) from pcopy where (lower("by") like '%part%' and lower("by") like '%and%' and lower("by") like '%the%') or (lower(title) like '%part%' and lower(title) like '%and%' and lower(title) like '%the%') or (lower(url) like '%part%' and lower(url) like '%and%' and lower(url) like '%the%') or (lower(text) like '%part%' and lower(text) like '%and%' and lower(text) like '%the%') group by "by", type, ranking order by 4 desc limit 10;
156.64s 155.63s

 

Could someone please explain why this issue occurs, and if there are any workarounds?

 

Thanks,

 

David

Cloudera Employee
Posts: 307
Registered: ‎10-16-2013

Re: Impala Queries slow using GROUP BY and LIKE

Impala provides query profiles to help understand/debug the performance of queries.

Take a look at the queries page on http://your-impalad-host:25000.

 

In your specific example, the first query can terminate much earlier because the scan can stop once 100 rows have been found. In your second query, all rows must be scanned and fed into the aggregation, and then all those rows must be fed into a top-n node. In short, the second query is inherently much more expensive based on what the query is asking for.