Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Select nth row in hive

avatar
Expert Contributor

HI,

I have 5 records in my database and i want to select the 3rd records.Minus is not possible in hive.So i'm trying Left outer join to perform the query.It's giving me some random results.

Id,Codes

1 100 1 200 2 200 3 200 3 300

select a.id,b.id from analytical a inner join (select id from analytical order by id desc limit 2) b on a.id=b.id where b.id is null order by a.id asc limit 3

OUtput:

id,codes

1 NULL 1 NULL 2 NULL

ANy suggestions?

1 ACCEPTED SOLUTION

avatar
Super Guru

@Vaibhav Kumar

To re-iterate what @mqureshi already noticed, your query does not seem functionally correct. 1=1 is true, but null=null is false. Different story.

If you use LIMIT row_count with ORDER BY, Hive, like MySQL and many other SQL-like engines, ends the sorting as soon as it has found the first row_count rows of the sorted result, rather than sorting the entire result.

ROW_NUMBER function and select third row is what you need.

Your idea of inner join will not scale for many records. If you have duplicates then write your query to eliminate the duplicate or deal with that, but one would still wonder how would you determine the true third row when you have a duplicate? I don't see anything in the query you wrote dealing with that problem. I think there is still a lot of work to do on paper before even writing SQL.

View solution in original post

9 REPLIES 9

avatar
Super Guru
@Vaibhav Kumar

Two things here.

1. I don't understand your use of "a.id = b.id where b.id is null". When b.id is null, a.id and b.id will never be equal. However, it's your query and you probably know more about it, so you can ignore my comment if you know what I am talking about.

2. I think you need to use ROW_NUMBER function and then select the third row. This link describes usage of ROW_NUMBER() for SQL 2005 but it's the same for Hive.

avatar
Expert Contributor

1. i think An inner join between the same table twice with limit 1 will solve this issue .

2 . I cannot go with rownumber as you can see i have duplicate column values which will not give me correct results

@mqureshi

avatar
New Contributor

please use below query

with t as (

select row_number() over (order by Codes) as RowNum, ID,Codes

)

select * from t

where RowNum=3;

avatar
Expert Contributor

@prasoon kumar with Doesn't work in hive

avatar
New Contributor

Hi Vaibhav,

Which version of hive you are working with?

Please check below link

https://cwiki.apache.org/confluence/display/Hive/Common+Table+Expression

avatar
Super Guru

@Vaibhav Kumar

To re-iterate what @mqureshi already noticed, your query does not seem functionally correct. 1=1 is true, but null=null is false. Different story.

If you use LIMIT row_count with ORDER BY, Hive, like MySQL and many other SQL-like engines, ends the sorting as soon as it has found the first row_count rows of the sorted result, rather than sorting the entire result.

ROW_NUMBER function and select third row is what you need.

Your idea of inner join will not scale for many records. If you have duplicates then write your query to eliminate the duplicate or deal with that, but one would still wonder how would you determine the true third row when you have a duplicate? I don't see anything in the query you wrote dealing with that problem. I think there is still a lot of work to do on paper before even writing SQL.

avatar
Expert Contributor

@Constantin Stanca yes it will not work in case in duplicates but in case of distinct values the below worked:

select a.id,b.id from analytical a inner join (select id from analytical order by id desc limit 3) b on a.id=b.id ORDER BY A.ID ASC LIMIT 1

avatar
Super Guru

@Vaibhav Kumar

No question that this could work, just a bit concerned with performance. Order by is a very expensive operation involving a single reducer. Practically, you do one full table scan for that and once you get those three records you do another table scan to get the attributes of those records.

Look at an alternative here: https://community.hortonworks.com/questions/24667/hive-top-n-records-within-a-group.html

If any response in this thread was helpful, don't forget to vote/accept the best response.

avatar
Expert Contributor

I'm not sure about my data because here even composite keys can produce duplicates. so going by analytical function is not a good choice for me.Any how query is not taking that much time for me.

I voted up for your Solution .Thanks For your Response. 🙂 @Constantin Stanca