Reply
Explorer
Posts: 11
Registered: ‎08-13-2014

Predict push down bug

Not sure if this is a known issue or already resolved but here we go. We are currently running 2.2.0-cdh5.4.5.

 

To repro, create these tables:


use debug;
create table orders(order_id int, address_id int);
insert into table orders values((1, 10), (2, 20), (3, 30));

create table addresses(address_id int, state string);
insert into table addresses values((10, "CA"), (20, "NY"), (30, "CA"));

 

Then run the query:

 

Select *
FROM (
    Select
      shipping_addr.state,
      count(order_id) as num_orders
    FROM
    (
        Select
          order_id,
          address_id
        FROM orders
    ) order_data
    LEFT OUTER JOIN [shuffle]
    (
        Select address_id,
              state
        FROM addresses
    ) shipping_addr
    ON order_data.address_id = shipping_addr.address_id
    GROUP BY
      shipping_addr.state
) unf
WHERE
   lower(state) = lower('NY')

 

When I run this I get:

 

+-------+------------+
| state | num_orders |
+-------+------------+
| NULL  | 2          |
| NY    | 1          |
+-------+------------+

 

I would expect to only see NY 1 because the other two orders I have are in CA. What seems to happen is the check on state gets pushed all the way into the query that selects from addresses and because that's a left join the orders that should get filtered out stay in, just without their state information.

 

+------------------------------------------------------------------------------------+
| Explain String                                                                     |
+------------------------------------------------------------------------------------+
| Estimated Per-Host Requirements: Memory=2.16GB VCores=2                            |
| WARNING: The following tables are missing relevant table and/or column statistics. |
| debug.addresses, debug.orders                                                      |
|                                                                                    |
| 08:EXCHANGE [UNPARTITIONED]                                                        |
| |                                                                                  |
| 07:AGGREGATE [FINALIZE]                                                            |
| |  output: count:merge(order_id)                                                   |
| |  group by: shipping_addr.state                                                   |
| |                                                                                  |
| 06:EXCHANGE [HASH(shipping_addr.state)]                                            |
| |                                                                                  |
| 03:AGGREGATE                                                                       |
| |  output: count(order_id)                                                         |
| |  group by: state                                                                 |
| |                                                                                  |
| 02:HASH JOIN [LEFT OUTER JOIN, PARTITIONED]                                        |
| |  hash predicates: address_id = address_id                                        |
| |                                                                                  |
| |--05:EXCHANGE [HASH(address_id)]                                                  |
| |  |                                                                               |
| |  01:SCAN HDFS [debug.addresses]                                                  |
| |     partitions=1/1 files=1 size=18B                                              |
| |     predicates: lower(debug.addresses.state) = lower('NY')                       |
| |                                                                                  |
| 04:EXCHANGE [HASH(address_id)]                                                     |
| |                                                                                  |
| 00:SCAN HDFS [debug.orders]                                                        |
|    partitions=1/1 files=1 size=15B                                                 |
+------------------------------------------------------------------------------------+

You see the predicate up there in the SCAN HDFS.

 

We've already sorted out a workaround but since we have such a clean repro I thought I'd share so you can create a bug for it unless I'm missing something and this is expected behaviour.

 

-Sean

 

 

Explorer
Posts: 11
Registered: ‎08-13-2014

Re: Predict push down bug

Pinging this to bring it back up.  This seems like a pretty clear bug, just wondering if it's either resolved or is at least on your radar :)

Cloudera Employee
Posts: 307
Registered: ‎10-16-2013

Re: Predict push down bug

Thanks a lot for posting this issue and the recent reminder! It is indeed a clear bug.

 

I've filed https://issues.cloudera.org/browse/IMPALA-3167 to track progress on this issue.

Cloudera Employee
Posts: 307
Registered: ‎10-16-2013

Re: Predict push down bug

In the meantime, you should be able to work around the problem by using an INNER JOIN instead of a LEF OUTER JOIN.

New Contributor
Posts: 4
Registered: ‎08-15-2016

Re: Predict push down bug

I hit this bug with 100% (I think) reproducibility.

My query has 14 JOINS

 

 

FROM id_and_week

LEFT OUTER JOIN xVC using (id_sddc, week)
LEFT OUTER JOIN xNSX using (id_sddc, week)

..

..

..

LEFT OUTER JOIN xVROPS using (id_sddc, week)
LEFT OUTER JOIN xVSAN using (id_sddc, week)
WHERE not  ....

 

 

The workaround query will be  ... I'm strugging to find the right words, you get the idea.

 

Cloudera Employee
Posts: 307
Registered: ‎10-16-2013

Re: Predict push down bug

Thanks for your input and please accept my apologies for this issue.

 

Would you mind pasting your full query so I can verify whether it is indeed the same or maybe a different issue?

 

New Contributor
Posts: 4
Registered: ‎08-15-2016

Re: Predict push down bug

[ Edited ]

 

Hi,

Here is the query. Assume that there are 14 xXXX* sub-queries ( the version of the query below has only 5 our of those 14 - xNSX, xVC, xVRA, vROPS, ...). 

The number of rows in each x* is between several hundred thousand rows and several hundred rows.

Assume id_and_week has millions of rows.

 

-----------------------------------------------------------------------------
--
-- VC<product> has the following columns 
-- id_sddc - lowercase vc id as in-------
-- <product>id - as crd_product_instance.id
-- <product>name - as crd_product_instance.name
-- <product>version - as crd_product_instance.version
-- <product>edition - as crd_product_instance.edition
--
-----------------------------------------------------------------------------

--
--
--


WITH
--
--
xNSX as (
with NSX as
(
select
to_date(trunc(pa__arrival_ts, 'D')) as week,
id as nsxid, name as nsxname, version as nsxversion, edition as nsxedition, pa__bundle__fk as joinid
FROM history.crd_product_instance
where pa__collector_id like 'asdfasdf%'
and name != 'xxx'
)
select DISTINCT
lower(id) as id_sddc, 
to_date(trunc(pa__arrival_ts, 'D')) as week, 
nsxid, nsxname, nsxversion, nsxedition 
FROM history.crd_product_instance 
inner join NSX on nsx.joinid = pa__bundle__fk
where name = 'yyyyy'
),


--
--
--
xVC as (
select DISTINCT
lower(id) as id_sddc, 
to_date(trunc(pa__arrival_ts, 'D')) as week,
id as vcid, name as vcname, version as vcversion, edition as vcedition
FROM history.crd_product_instance
where name ='zzzzz'
)
,

--
--

xVRA as (
with VRA as
(
select
id as vraid, name as vraname, version as vraversion, edition as vraedition, pa__collector_instance_id as joinid
FROM history.crd_product_instance
where pa__collector_id like 'vcac.%'
and name != 'aaaaa'
)
select DISTINCT
lower(id) as id_sddc, 
to_date(trunc(pa__arrival_ts, 'D')) as week,
vraid, vraname, vraversion, vraedition 
FROM history.crd_product_instance 
inner join VRA on vra.joinid = pa__collector_instance_id
where name = 'bbbbb'
),

--
--
xVROPS as (
with VROPS as
(
select
to_date(trunc(pa__arrival_ts, 'D')) as week,
id as vropsid, name as vropsname, version as vropsversion, edition as vropsedition, pa__bundle__fk as joinid
FROM history.crd_product_instance
where pa__collector_id like 'xxxxx%'
and name != '1111111'
)
select DISTINCT
lower(identifier) as id_sddc, 
to_date(trunc(pa__arrival_ts, 'D')) as week, 
vropsid, vropsname, vropsversion, vropsedition 
FROM history.VREALIZE_OPERATIONS_MANAGER_ADAPTER_DETAILS 
inner join VROPS on vrops.joinid = pa__bundle__fk
where identifier is not null and identifier not like ''
),
--
--
xVSAN as (
with VSAN as
(
select
to_date(trunc(pa__arrival_ts, 'D')) as week,
id as vsanid, name as vsanname, version as vsanversion, edition as vsanedition, lower(pa__collector_instance_id) as joinid
FROM history.crd_product_instance
where pa__collector_id like 'zzzzzzzzz.%'
)
select DISTINCT
lower(id) as id_sddc, 
to_date(trunc(pa__arrival_ts, 'D')) as week, 
vsanid, vsanname, vsanversion, vsanedition
FROM history.crd_product_instance
join VSAN on vsan.joinid = lower(id)
where pa__collector_id not like 'zzzzzz.%'
)
,
id_and_week as (
SELECT DISTINCT id_sddc, week
FROM (
select * from xVC
union all select * from xNSX
union all select * from xVRA
union all select * from xVROPS
union all select * from xVSAN
) X
-- where id_sddc is not null and week is not null -- commented or not - this line does not affect the result
)
,
finale as (
SELECT distinct 
id_and_week.id_sddc,
id_and_week.week
--
,xVC.vcid, xVC.vcname, xVC.vcversion, xVC.vcedition
,xNSX.nsxid, xNSX.nsxname, xNSX.nsxversion, xNSX.nsxedition
,xVRA.vraid, xVRA.vraname, xVRA.vraversion, xVRA.vraedition
,xVROPS.vropsid, xVROPS.vropsname, xVROPS.vropsversion, xVROPS.vropsedition
,xVSAN.vsanid, xVSAN.vsanname, xVSAN.vsanversion, xVSAN.vsanedition
FROM id_and_week

LEFT OUTER JOIN xVC using (id_sddc, week)
LEFT OUTER JOIN xNSX using (id_sddc, week)
LEFT OUTER JOIN xVRA using (id_sddc, week)
LEFT OUTER JOIN xVROPS using (id_sddc, week)
LEFT OUTER JOIN xVSAN using (id_sddc, week)
WHERE not (
xNSX.NSXid is null and
xVRA.vraid is null and
xVROPS.vropsid is null and
xVSAN.vsanid is null
)
)


select * from finale
where id_sddc like '2016-%'
order by id_sddc, week
 

Cloudera Employee
Posts: 307
Registered: ‎10-16-2013

Re: Predict push down bug

Hi PyMeh,

 

thanks for posting your query. I was able to verify that your query is indeed hitting IMPALA-3167.

 

The WHERE-clause predicate "id_sddc like '2016-%'" is assigned to scans, but should also be assigned to LEFT OUTER JOINS, so you may get incorrect results.

 

Is the workaround of using INNER JOIN instead of LEFT OUTER JOIN acceptable to you?

 

Alex

New Contributor
Posts: 4
Registered: ‎08-15-2016

Re: Predict push down bug

Hi Alex,
The query is used in Tableau, and Tableau requires a schema where columns from all 14 x* (xVC, xNSX, etc.) tables are "glued" together:

xABC
+---------+---+------+------+
| id_sddc | A | B | C |
+---------+---+------+------+
| 2 | 2 | 2222 | 2222 |
| 3 | 3 | 3333 | 3333 |
+---------+---+------+------+

xDEF
+---------+---+------+------+
| id_sddc | D | E | F |
+---------+---+------+------+
| 2 | 2 | 22DD | 22DD |
| 3 | 3 | 3333 | 3333 |
+---------+---+------+------+


Expected final result
+---------+---+------+------+---+------+------+----
| id_sddc | A | B | C | D | E | F | .... many more columns to the right ...
+---------+---+------+------+---+------+------+----
| 2 | 2 | 2222 | 2222 | 3 | 2eee | 2fDD | ....
| 3 | 3 | 3333 | 3333 | 3 | 3eee | 3dfd | ....
+---------+---+------+------+---+------+------+ ---



Although in theory there is a possibility for my case to be possible to implement with INNER JOINs I dont even want to try that mostly because of the business assumptions I'd have to put in my queries. Assumptions that are beyond my control.

-Rumen

Cloudera Employee
Posts: 307
Registered: ‎10-16-2013

Re: Predict push down bug

Thanks for clarifying where this query came from. That's an important detail to us. Seems like the workaround is not really prectical in your use case.

Announcements