Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Hive - top n records within a group

avatar
Contributor

Hi all,

I have a table with the fields user_id and value and I want to order the values in descending order within each user_id and then only emit the top 100 records for each user_id. This is the code I am attempting to use:

DROP TABLE IF EXISTS mytable2
CREATE TABLE mytable2 AS
SELECT * FROM 
 (SELECT *, rank (user_id) as rank
  FROM 
  (SELECT * from mytable
  DISTRIBUTE BY user_id
  SORT BY value DESC)a )b
WHERE rank<101
ORDER BY rank;

However when I run this query, I get the following error:

 Error while compiling statement: FAILED: SemanticException [Error 10247]: Missing over clause for function : rank [ERROR_STATUS]

Can anyone help?

Thanks in advance.

1 ACCEPTED SOLUTION

avatar
Master Guru

Hello Maeve,

you did go a bit too deep. You can use the Rank clause in combination with an over statement that describes how Hive should rank the columns. Otherwise how would he know by what to rank the users. The partition by clause also tells hive to distribute by userid and to sort inside a userid without you needing to specify it specifically. Below is what you want right?

select * from (
	select user_id, value, desc, 
	rank() over ( partition by user_id order by value desc) as rank 
	from test4 ) t where rank < 3;

The output looks like this:

OK
1	2	hallo	1
1	1	hallo	2
2	11	hallo4	1
2	10	hallo3	2

The input table looks like this:

hive> select * from test4;
OK
USER_ID VALUE   DESC
1	1	hallo
1	2	hallo
2	3	hallo1
2	7	hallo1
2	10	hallo3
2	11	hallo4

View solution in original post

2 REPLIES 2

avatar
Master Guru

Hello Maeve,

you did go a bit too deep. You can use the Rank clause in combination with an over statement that describes how Hive should rank the columns. Otherwise how would he know by what to rank the users. The partition by clause also tells hive to distribute by userid and to sort inside a userid without you needing to specify it specifically. Below is what you want right?

select * from (
	select user_id, value, desc, 
	rank() over ( partition by user_id order by value desc) as rank 
	from test4 ) t where rank < 3;

The output looks like this:

OK
1	2	hallo	1
1	1	hallo	2
2	11	hallo4	1
2	10	hallo3	2

The input table looks like this:

hive> select * from test4;
OK
USER_ID VALUE   DESC
1	1	hallo
1	2	hallo
2	3	hallo1
2	7	hallo1
2	10	hallo3
2	11	hallo4

avatar
Contributor

Thanks a lot Benjamin - I did realise after posting the above that I needed a UDF to use with the rank function on its own. It's working now so thank you.