Support Questions
Find answers, ask questions, and share your expertise

Impala sub query SUM

Impala sub query SUM

New Contributor

Hi Team,
I have a below query which retruns top three entities based on sum of weighting.
select entity, sum(weighting) weighting from entity_weighting where weighting is not null and CONCAT(year, month) in ('201310', '201311') group by entity order by weighting desc, entity limit 3
Below is the ResultSet:-
+------------+-----------+
| entity     | weighting |
+------------+-----------+
| Entity9999 | 275993268 |
| Entity875  | 270759260 |
| Entity742  | 105082776 |
+------------+-----------+

 

But If I Select entity from the above query as a sub query it returns different values. Below is the query.
select entity from (select entity, sum(weighting) weighting from entity_weighting where weighting is not null and CONCAT(year, month) in ('201310', '201311') group by entity order by weighting desc, entity limit 3) test
Below is the ResultSet:-
+-----------+
| entity    |
+-----------+
| Entity1   |
| Entity10  |
| Entity100 |
+-----------+

 

We have created a table using the above mentioned sub query and when we join the tables, it is showing correct resultset.

 

Is there any issue in Impala with derived tables in from clause or is this any other known issue.

 

Please also suggest if there is any workaround. Also creation of a table for subquery is not a practical solution in our case.


Thank in Advance,
Rohith.

2 REPLIES 2

Re: Impala sub query SUM

Cloudera Employee
Do you get correct / expected results if you include the weighting column also in the result set, i.e.

select entity, weighting from (select entity, sum(weighting) weighting from entity_weighting...

If so, perhaps a workaround could be to wrap the query in one more level of subquery and only pull out the 'entity' value at the highest level:

select entity from (select entity, weighting from (select entity, sum(weighting) weighting from entity_weighting...

The observed behavior sounds like a bug, but that's better handled by filing a JIRA at issues.cloudera.org to see if it's a known issue or has already been fixed in a recent or upcoming release.

Re: Impala sub query SUM

Master Collaborator

Hi Rohith,

 

I'm afraid you've ran into a known Planner problem https://issues.cloudera.org/browse/IMPALA-822

Although the description in that JIRA may seem unrelated to your issue, I assure you it is the same underlying problem.

A fix for this issue is currently in review and may make it into the upcoming Impala 1.2.4 release.

 

In short, the problem happens because the outer query block does not reference 'weighting' in the select list. Impala tries to optimize away unreferenced columns from views, but there is a bug in that part of the code (it incorrectly optimizes away weighting in this case, that's why the order is wrong). If you you changed your query to the following it should return correct results:

 

select entity, weighting from (select entity, sum(weighting) weighting from entity_weighting where weighting is not null and CONCAT(year, month) in ('201310', '201311') group by entity order by weighting desc, entity limit 3) test

 

You can work around the problem using the following (admittedly ugly) trick:

select entity from

  (select entity, sum(weighting) weighting

   from entity_weighting where weighting is not null and CONCAT(year, month) in ('201310', '201311')

   group by entity order by weighting desc, entity limit 3

   union all values(null, null)) test

 

The union forces all underlying columns to be 'referenced', fixing the problem. You can adjust then query to filter out the dummy NULL.

 

Hope it helps!

 

Alex