We have an external hive table (let's call it example_table) created on top of an orc file in Hadoop. Doing a simple select query such as:
select * from example_table
works fine, but
select * from example_table order by id
returns lots of extra rows that look like corrupt data. We have seen cases where this returns 10x the number of rows as the query without the order by clause.
If its relevant, the orc files were created by Spark and we are using Hive 1.2.1 on HDP 2.5. I suspect this may have something to do with the fact that the orc data has fields of string datatype that contain newlines. Is this a known bug and/or are there any Hive settings we could try changing to fix this?
UPDATE: Here is a Hive script we came up with to reproduce the problem:
create table default.test_errors(c1 string)
stored as orc;
--Make sure to include newlines here
with CTE as
(select
'a
b
c'
as c1)
insert into default.test_errors
select c1 from CTE;
--Returns 1 row
select * from default.test_errors;
--Returns 3 rows
select * from default.test_errors
order by c1;