Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Hive returns extra rows with "order by"

Solved Go to solution
Highlighted

Hive returns extra rows with "order by"

We have an external hive table (let's call it example_table) created on top of an orc file in Hadoop. Doing a simple select query such as:

select * from example_table

works fine, but

select * from example_table order by id

returns lots of extra rows that look like corrupt data. We have seen cases where this returns 10x the number of rows as the query without the order by clause.

If its relevant, the orc files were created by Spark and we are using Hive 1.2.1 on HDP 2.5. I suspect this may have something to do with the fact that the orc data has fields of string datatype that contain newlines. Is this a known bug and/or are there any Hive settings we could try changing to fix this?

UPDATE: Here is a Hive script we came up with to reproduce the problem:

create table default.test_errors(c1 string) stored as orc;

--Make sure to include newlines here

with CTE as (select 'a

b

c' as c1)

insert into default.test_errors

select c1 from CTE;

--Returns 1 row

select * from default.test_errors;

--Returns 3 rows

select * from default.test_errors order by c1;

1 ACCEPTED SOLUTION

Accepted Solutions
Highlighted

Re: Hive returns extra rows with "order by"

Expert Contributor

Based on your description, I believe you are encountering the problem that someone described here.

Your simple select works fine, but as soon as you need a mapreduce step (because of the order by in this case) the handling of newline characters fails.

Add this property to the hive-site via Ambari

hive.query.result.fileformat=SequenceFile

Now the query should work as expected since the query output is written in a binary format as opposed to a text format.


If this answer helped, please mark it as 'solved' and/or if it is valuable for future readers please apply 'kudos'.

View solution in original post

10 REPLIES 10
Highlighted

Re: Hive returns extra rows with "order by"

Contributor

are you definitely doing select \* in both cases? it's possible that doing select \* has some column where the data type fails evaluation where select x, y, z order by x perhaps excludes that column?

Highlighted

Re: Hive returns extra rows with "order by"

Not 100% sure what you mean but I do think the issue is related to certain columns in the table. If I do a

"select id order by id"

that comes through without any issues. It's only when certain columns are included that we get corruption, and these seem to be the columns that are of datatype string with newlines in them.

Highlighted

Re: Hive returns extra rows with "order by"

Contributor

that's odd indeed, I'd expect the inverse behavior to be possible

Highlighted

Re: Hive returns extra rows with "order by"

Contributor

have you tried to update the data type format for the text columns that contain newline characters to handle the newline characters internally? perhaps hive is somehow interpreting the newline within this column as a new data row, which could account for the additional rows, this is a relatively rough stab though, I'd have to do some more research

Highlighted

Re: Hive returns extra rows with "order by"

AFAIK Hive should be able to handle text with newlines if the table is stored in a binary format such as ORC, Avro, Parquet etc. We could strip out the newlines from the data before putting it in Hive, but we'd rather not do that since the original source data (coming from SQL Server) contains newlines.

Highlighted

Re: Hive returns extra rows with "order by"

Contributor

I agree that's the expected behavior, merely suggesting some troubleshooting to dig into exactly why the issue is happening

Highlighted

Re: Hive returns extra rows with "order by"

Expert Contributor

When you select all columns and order by id, what do the id values look like (in general, and specifically in the rows that appear to be extra). What happens when you use one of those id values in a where statement (with/without an order by)?


If this answer helped, please mark it as 'solved' and/or if it is valuable for future readers please apply 'kudos'.
Highlighted

Re: Hive returns extra rows with "order by"

The id values in the extra rows are null even though the source data does not contain any rows with a null id. By all appearances it's returning rows of corrupt data that do not really exist in the table.

Highlighted

Re: Hive returns extra rows with "order by"

Expert Contributor

Based on your description, I believe you are encountering the problem that someone described here.

Your simple select works fine, but as soon as you need a mapreduce step (because of the order by in this case) the handling of newline characters fails.

Add this property to the hive-site via Ambari

hive.query.result.fileformat=SequenceFile

Now the query should work as expected since the query output is written in a binary format as opposed to a text format.


If this answer helped, please mark it as 'solved' and/or if it is valuable for future readers please apply 'kudos'.

View solution in original post

Don't have an account?
Coming from Hortonworks? Activate your account here