Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Bug report: sort by and timestamp casting filters out too many rows in hive

Highlighted

Bug report: sort by and timestamp casting filters out too many rows in hive

Expert Contributor

Consider this example.

Preparation: 

create temporary table opens as (
  select stack(1, 
         1 , cast('2019-11-13 08:07:28' as timestamp)
   ) as (id , load_ts )
);

 

Queries: This is just about counting the number of rows, with filters always matching, and possibly sort by. 1 is always expected.

select count(*) from ( select * from opens) t;
select count(*) from ( select * from opens sort by id) t;

select count(*) from ( select * from opens where load_ts >= '2019-11-13 08:07:00' ) t;
select count(*) from ( select * from opens where load_ts >= '2019-11-13 08:07:00' sort by id) t;

select count(*) from ( select * from opens where load_ts <= '2019-11-13 09:07:00') t;
select count(*) from ( select * from opens where load_ts <= '2019-11-13 09:07:00' sort by id) t;

 The last query (sort by and <= on timestamp) returns 0 rows.

I believe that this is the cause of other issues I have, where I have missing rows in queries with the timestamp (but not the explicit sort by).

Note that if instead of a temporary table I use a CTE for opens, the issue does not appear.

I tried workarounds (inverse order of operands, adding not or not not ) to no avail.

One thing that did work is to explicitly cast the string to a timestamp:

select count(*) from ( select * from opens where load_ts <= cast('2019-11-13 09:07:00' as timestamp) sort by id) t;

It might be good practice indeed, but there still is a discrepancy between how >= and <= are handled, or how sort by works.

 

Note: this is on Hive from 3.1.4, without llap.

Don't have an account?
Coming from Hortonworks? Activate your account here