Welcome to the Cloudera Community

premiere · ‎02-10-2017

Hi,

We are testing Apache Impala and have noticed that using GROUP BY and LIKE together works very slowly -- separate queries work much faster. Here are two examples:

SELECT * FROM hive.default.pcopy1B where (lower("by") like '%part%' and lower("by") like '%and%' and lower("by") like '%the%') or (lower(title) like '%part%' and lower(title) like '%and%' and lower(title) like '%the%') or (lower(url) like '%part%' and lower(url) like '%and%' and lower(url) like '%the%') or (lower(text) like '%part%' and lower(text) like '%and%' and lower(text) like '%the%') limit 100;

1.37s 1.08s 1.35s

select "by", type, ranking, count(*) from pcopy where (lower("by") like '%part%' and lower("by") like '%and%' and lower("by") like '%the%') or (lower(title) like '%part%' and lower(title) like '%and%' and lower(title) like '%the%') or (lower(url) like '%part%' and lower(url) like '%and%' and lower(url) like '%the%') or (lower(text) like '%part%' and lower(text) like '%and%' and lower(text) like '%the%') group by "by", type, ranking order by 4 desc limit 10;
156.64s 155.63s

Could someone please explain why this issue occurs, and if there are any workarounds?

Thanks,

David

Cloudera Community

Welcome to the Cloudera Community

Who agreed with this topic

Impala Queries slow using GROUP BY and LIKE