Support Questions

KFredrickson · ‎08-03-2018

We have an external hive table (let's call it example_table) created on top of an orc file in Hadoop. Doing a simple select query such as:

select * from example_table

works fine, but

select * from example_table order by id

returns lots of extra rows that look like corrupt data. We have seen cases where this returns 10x the number of rows as the query without the order by clause.

If its relevant, the orc files were created by Spark and we are using Hive 1.2.1 on HDP 2.5. I suspect this may have something to do with the fact that the orc data has fields of string datatype that contain newlines. Is this a known bug and/or are there any Hive settings we could try changing to fix this?

UPDATE: Here is a Hive script we came up with to reproduce the problem:

create table default.test_errors(c1 string) stored as orc;

--Make sure to include newlines here

with CTE as (select 'a

b

c' as c1)

insert into default.test_errors

select c1 from CTE;

--Returns 1 row

select * from default.test_errors;

--Returns 3 rows

select * from default.test_errors order by c1;

DennisJaheruddi · ‎08-10-2018

Based on your description, I believe you are encountering the problem that someone described here.

Your simple select works fine, but as soon as you need a mapreduce step (because of the order by in this case) the handling of newline characters fails.

Add this property to the hive-site via Ambari

hive.query.result.fileformat=SequenceFile

Now the query should work as expected since the query output is written in a binary format as opposed to a text format.

- Dennis Jaheruddin

If this answer helped, please mark it as 'solved' and/or if it is valuable for future readers please apply 'kudos'.

View solution in original post

DennisJaheruddi · ‎08-10-2018

Some more digging yielded that it is safe to apply this setting globally, and that the issue has been resolvedin hive 2

- Dennis Jaheruddin

If this answer helped, please mark it as 'solved' and/or if it is valuable for future readers please apply 'kudos'.

Cloudera Community

Support Questions

Hive returns extra rows with "order by"