Created 09-20-2020 10:26 PM
Hello,
Observing "union all" is dropping records with all null/empty fields.
CREATE tABLE IF NOT EXISTS
as_adventure.t1
(
productsubcategorykey INT,
productline STRING);
insert into t1 values (1,'l1');
insert into t1 values (2,'l1');
insert into t1 values (1,'l2');
insert into t1 values (3,'l3');
insert into t1 values (null,'');
select * from t1;
SELECT
MIN(t_53.c_41) c_41,
CAST(NULL AS DOUBLE) c_43,
CAST(NULL AS BIGINT) c_44,
t_53.c2 c2,
t_53.c3s0 c3s0,
t_53.c4 c4,
t_53.c5s0 c5s0
FROM
( SELECT
t.productsubcategorykey c_41,
t.productline c2,
t.productline c3s0,
t.productsubcategorykey c4,
t.productsubcategorykey c5s0
FROM
as_adventure.t1 t
WHERE
true
GROUP BY
2,
3,
4,
5 ) t_53
GROUP BY
4,
5,
6,
7
UNION ALL
SELECT
MIN(t_53.c_41) c_41,
CAST(NULL AS DOUBLE) c_43,
CAST(NULL AS BIGINT) c_44,
t_53.c2 c2,
t_53.c3s0 c3s0,
t_53.c4 c4,
t_53.c5s0 c5s0
FROM
( SELECT
t.productsubcategorykey c_41,
t.productline c2,
t.productline c3s0,
t.productsubcategorykey c4,
t.productsubcategorykey c5s0
FROM
as_adventure.t1 t
WHERE
true
GROUP BY
2,
3,
4,
5 ) t_53
GROUP BY
4,
5,
6,
7
Above query only shows 8 records instead of 10 records.
This is observed for the following versions:
Created 09-21-2020 05:17 AM
Hello @aryan_dear ,
thank you for reporting your observations on Impala UNION ALL returning wrong results up to CDH6.
The issue you are facing with is most likely:
https://issues.apache.org/jira/browse/IMPALA-7957
The fix for this issue is implemented from Impala3.3., for example in CDP.
Please let us know if you need more information on this topic!
Kind regards:
Ferenc
Ferenc Erdelyi, Technical Solutions Manager
Was your question answered? Make sure to mark the answer as the accepted solution.
If you find a reply useful, say thanks by clicking on the thumbs up button.
Learn more about the Cloudera Community:
Created on 09-21-2020 09:57 AM - edited 09-21-2020 10:03 AM
This is definitely a bug. Thanks for the clear report and reproduction. It's not IMPALA-7957 but is somewhat related. This is new to us so I filed https://issues.apache.org/jira/browse/IMPALA-10182 to track it.
It looks like it can only happen when you have a UNION ALL, plus subqueries where the same column appears twice in the select list, plus NULL values in those columns.
You can work around the issue by removing the duplicated entries in the subquery select list. E.g. the following query is equivalent and returns the expected results.
SELECT
MIN(t_53.c_41) c_41,
CAST(NULL AS DOUBLE) c_43,
CAST(NULL AS BIGINT) c_44,
t_53.c2 c2,
t_53.c2 c3s0,
t_53.c4 c4,
t_53.c4 c5s0
FROM
( SELECT
t.productsubcategorykey c_41,
t.productline c2,
t.productsubcategorykey c4
FROM
as_adventure.t1 t
WHERE
true
GROUP BY
2,
3 ) t_53
GROUP BY
4,
5,
6,
7
UNION ALL
SELECT
MIN(t_53.c_41) c_41,
CAST(NULL AS DOUBLE) c_43,
CAST(NULL AS BIGINT) c_44,
t_53.c2 c2,
t_53.c2 c3s0,
t_53.c5s0 c4,
t_53.c5s0 c5s0
FROM
( SELECT
t.productsubcategorykey c_41,
t.productline c2,
t.productsubcategorykey c5s0
FROM
as_adventure.t1 t
WHERE
true
GROUP BY
2,
3) t_53
GROUP BY
4,
5,
6,
7;