Since the upgrade from hdp 3.1.0 to 3.1.4, I have some issue in Hive I do not understand. Note that I am only using ORC transactional tables.
For instance this (simplified) query:
with cte as (
from event e
join contact as c on c.id=e.contact_id
lateral view outer
json_tuple(cte.json, 'customfield') cv AS `customfield`
It worked perfectly before the upgrade.
Now, even if the CTE returns a certain number of rows, using the lateral view will just drop rows from
the resultset, without any error, whereas there is no extra where clause outside the CTE (in my real example, the query returns 66 rows without the lateral view, but only 19 with).
Another extremely surprising thing is that inside the CTE there is an if statement, for instance: `if(contact.is_deleted is null, 'true', 'false')`. If I replace the `is null` with `is not distinct from null`, which should be perfectly valid, no rows are returned by the CTE.
I tried quite a few variations:
extract one row from the contact table to create a second table with only one contact (`create table contact2 as select * from contact where id=42`). I get the exact same behaviour, so I ruled out corrupted data
if I replace the event table by a static CTE (`select stack(1, ...)`) I have the result I expect
if I remove the lateral view, I have the number of rows I expect (as long as I do not use is distinct from)
if instead of a CTE I create and use a temporary table, the outcome does not change.
I am completely at loss and I have no idea why this happens and how I can trust hive.
I cannot replicate the error by generating manual data so I cannot give a (not) working example.