Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Hive - top n records within a group

Solved Go to solution
Highlighted

Hive - top n records within a group

Hi all,

I have a table with the fields user_id and value and I want to order the values in descending order within each user_id and then only emit the top 100 records for each user_id. This is the code I am attempting to use:

DROP TABLE IF EXISTS mytable2
CREATE TABLE mytable2 AS
SELECT * FROM 
 (SELECT *, rank (user_id) as rank
  FROM 
  (SELECT * from mytable
  DISTRIBUTE BY user_id
  SORT BY value DESC)a )b
WHERE rank<101
ORDER BY rank;

However when I run this query, I get the following error:

 Error while compiling statement: FAILED: SemanticException [Error 10247]: Missing over clause for function : rank [ERROR_STATUS]

Can anyone help?

Thanks in advance.

1 ACCEPTED SOLUTION

Accepted Solutions
Highlighted

Re: Hive - top n records within a group

Hello Maeve,

you did go a bit too deep. You can use the Rank clause in combination with an over statement that describes how Hive should rank the columns. Otherwise how would he know by what to rank the users. The partition by clause also tells hive to distribute by userid and to sort inside a userid without you needing to specify it specifically. Below is what you want right?

select * from (
	select user_id, value, desc, 
	rank() over ( partition by user_id order by value desc) as rank 
	from test4 ) t where rank < 3;

The output looks like this:

OK
1	2	hallo	1
1	1	hallo	2
2	11	hallo4	1
2	10	hallo3	2

The input table looks like this:

hive> select * from test4;
OK
USER_ID VALUE   DESC
1	1	hallo
1	2	hallo
2	3	hallo1
2	7	hallo1
2	10	hallo3
2	11	hallo4

View solution in original post

2 REPLIES 2
Highlighted

Re: Hive - top n records within a group

Hello Maeve,

you did go a bit too deep. You can use the Rank clause in combination with an over statement that describes how Hive should rank the columns. Otherwise how would he know by what to rank the users. The partition by clause also tells hive to distribute by userid and to sort inside a userid without you needing to specify it specifically. Below is what you want right?

select * from (
	select user_id, value, desc, 
	rank() over ( partition by user_id order by value desc) as rank 
	from test4 ) t where rank < 3;

The output looks like this:

OK
1	2	hallo	1
1	1	hallo	2
2	11	hallo4	1
2	10	hallo3	2

The input table looks like this:

hive> select * from test4;
OK
USER_ID VALUE   DESC
1	1	hallo
1	2	hallo
2	3	hallo1
2	7	hallo1
2	10	hallo3
2	11	hallo4

View solution in original post

Re: Hive - top n records within a group

Thanks a lot Benjamin - I did realise after posting the above that I needed a UDF to use with the rank function on its own. It's working now so thank you.

Don't have an account?
Coming from Hortonworks? Activate your account here