Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

"union all" dropping records with all null/empty fields

avatar
New Contributor

Hello,

 

Observing "union all" is dropping records with all null/empty fields. 

 

CREATE tABLE IF NOT EXISTS
    as_adventure.t1 
    ( 
        productsubcategorykey INT, 
        productline STRING);

insert into t1 values (1,'l1');
insert into t1 values (2,'l1');
insert into t1 values (1,'l2');
insert into t1 values (3,'l3');
insert into t1 values (null,'');

select * from t1; 

SELECT
    MIN(t_53.c_41)       c_41,
    CAST(NULL AS DOUBLE) c_43,
    CAST(NULL AS BIGINT) c_44,
    t_53.c2              c2,
    t_53.c3s0            c3s0,
    t_53.c4              c4,
    t_53.c5s0            c5s0
FROM
    (   SELECT
            t.productsubcategorykey c_41,
            t.productline           c2,
            t.productline           c3s0,
            t.productsubcategorykey c4,
            t.productsubcategorykey c5s0
        FROM
            as_adventure.t1 t
        WHERE
            true
        GROUP BY
            2,
            3,
            4,
            5 ) t_53
GROUP BY
    4,
    5,
    6,
    7
 
UNION ALL

SELECT
    MIN(t_53.c_41)       c_41,
    CAST(NULL AS DOUBLE) c_43,
    CAST(NULL AS BIGINT) c_44,
    t_53.c2              c2,
    t_53.c3s0            c3s0,
    t_53.c4              c4,
    t_53.c5s0            c5s0
FROM
    (   SELECT
            t.productsubcategorykey c_41,
            t.productline           c2,
            t.productline           c3s0,
            t.productsubcategorykey c4,
            t.productsubcategorykey c5s0
        FROM
            as_adventure.t1 t
        WHERE
            true
        GROUP BY
            2,
            3,
            4,
            5 ) t_53
GROUP BY
    4,
    5,
    6,
    7

 

  

Above query only shows 8 records instead of 10 records. 

 

This is observed for the following versions:

  1. 5.7.6
  2. 5.8.5
  3. 5.9.3
  4. 5.10.2
  5. 5.11.2
  6. 5.12.2
  7. 5.13.1
  8. 5.14.0
  9. 5.14.4
  10. 5.15.1
  11. 5.16.2
  12. 6.0.1
  13. 6.1.1
  14. 6.2.0
  15. 6.3.1
2 REPLIES 2

avatar
Moderator

Hello @aryan_dear ,

 

thank you for reporting your observations on Impala UNION ALL returning wrong results up to CDH6.

 

The issue you are facing with is most likely:

https://issues.apache.org/jira/browse/IMPALA-7957

 

The fix for this issue is implemented from Impala3.3., for example in CDP.

 

Please let us know if you need more information on this topic!

 

Kind regards:

Ferenc


Ferenc Erdelyi, Technical Solutions Manager

Was your question answered? Make sure to mark the answer as the accepted solution.
If you find a reply useful, say thanks by clicking on the thumbs up button.

Learn more about the Cloudera Community:

avatar

This is definitely a bug. Thanks for the clear report and reproduction. It's not IMPALA-7957 but is somewhat related. This is new to us so I filed https://issues.apache.org/jira/browse/IMPALA-10182 to track it.

 

It looks like it can only happen when you have a UNION ALL, plus subqueries where the same column appears twice in the select list, plus NULL values in those  columns.

 

You can work around the issue by removing the duplicated entries in the subquery select list. E.g. the following query is equivalent and returns the expected results.

 

SELECT
    MIN(t_53.c_41)       c_41,
    CAST(NULL AS DOUBLE) c_43,
    CAST(NULL AS BIGINT) c_44,
    t_53.c2              c2,
    t_53.c2            c3s0,
    t_53.c4              c4,
    t_53.c4            c5s0
FROM
    (   SELECT
            t.productsubcategorykey c_41,
            t.productline           c2,
            t.productsubcategorykey c4
        FROM
            as_adventure.t1 t
        WHERE
            true
        GROUP BY
            2,
            3 ) t_53
GROUP BY
    4,
    5,
    6,
    7

UNION ALL
SELECT
    MIN(t_53.c_41)       c_41,
    CAST(NULL AS DOUBLE) c_43,
    CAST(NULL AS BIGINT) c_44,
    t_53.c2              c2,
    t_53.c2            c3s0,
    t_53.c5s0              c4,
    t_53.c5s0            c5s0
FROM
    (   SELECT
            t.productsubcategorykey c_41,
            t.productline           c2,
            t.productsubcategorykey c5s0
        FROM
            as_adventure.t1 t
        WHERE
            true
        GROUP BY
            2,
            3) t_53
GROUP BY
    4,
    5,
    6,
    7;