Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Top 5 Distinct Records using Hive

Highlighted

Top 5 Distinct Records using Hive

New Contributor

I am trying to pull the top five states with the highest measurements related to a specific measureid. My issue is that I am trying to pull DISTINCT states with the highest measurements for each year from 2008 - 2013, i.e., highest state for 2008, 2009, 2010, etc.

My query:

select distinct statename, measureid, reportyear, value  from air_quality  where measureid = 87  and reportyear >= 2008  and reportyear <= 2013  sort by value desc limit 5 

I expect the output with DISTINCT statenames, i.e., I don't want one repeated. If California has the highest for one year, it will not be repeated again.

Currently it displays as "... California, 22 ; ... California, 22 ; ... California, 19 ; ... Arizona, 18 ; ... California, 18"


What are some thoughts?

3 REPLIES 3

Re: Top 5 Distinct Records using Hive

New Contributor

you can try it.

select * from (

select *,row_number() over(partition by statename, measureid, reportyear, value order by value desc ) as num

from air_quality ) as t where t.num=1 and measureid = 87 and reportyear >= 2008 and reportyear <= 2013

Is that right?

Re: Top 5 Distinct Records using Hive

New Contributor

@bin liu This almost works. It displays every row in the table when I only need the top five distinct states.

Re: Top 5 Distinct Records using Hive

Expert Contributor
with t as (select *,row_number() over(partition by statename order by value desc ) as num from air_quality where measureid = 87 and reportyear >= 2008 and reportyear <= 2013)
select statename, measureid, reportyear, value from t where num=1 order by value desc limit 5