Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Impala Queries slow using GROUP BY and LIKE

Impala Queries slow using GROUP BY and LIKE

New Contributor

Hi,

 

We are testing Apache Impala and have noticed that using GROUP BY and LIKE together works very slowly -- separate queries work much faster. Here are two examples:

 

SELECT * FROM hive.default.pcopy1B where (lower("by") like '%part%' and lower("by") like '%and%' and lower("by") like '%the%') or (lower(title) like '%part%' and lower(title) like '%and%' and lower(title) like '%the%') or (lower(url) like '%part%' and lower(url) like '%and%' and lower(url) like '%the%') or (lower(text) like '%part%' and lower(text) like '%and%' and lower(text) like '%the%') limit 100;


1.37s 1.08s 1.35s

select "by", type, ranking, count(*) from pcopy where (lower("by") like '%part%' and lower("by") like '%and%' and lower("by") like '%the%') or (lower(title) like '%part%' and lower(title) like '%and%' and lower(title) like '%the%') or (lower(url) like '%part%' and lower(url) like '%and%' and lower(url) like '%the%') or (lower(text) like '%part%' and lower(text) like '%and%' and lower(text) like '%the%') group by "by", type, ranking order by 4 desc limit 10;
156.64s 155.63s

 

Could someone please explain why this issue occurs, and if there are any workarounds?

 

Thanks,

 

David

1 REPLY 1

Re: Impala Queries slow using GROUP BY and LIKE

Master Collaborator

Impala provides query profiles to help understand/debug the performance of queries.

Take a look at the queries page on http://your-impalad-host:25000.

 

In your specific example, the first query can terminate much earlier because the scan can stop once 100 rows have been found. In your second query, all rows must be scanned and fed into the aggregation, and then all those rows must be fed into a top-n node. In short, the second query is inherently much more expensive based on what the query is asking for.

Don't have an account?
Coming from Hortonworks? Activate your account here