Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Handling multiple records in hive

SOLVED Go to solution
Highlighted

Handling multiple records in hive

New Contributor

Hi All,

I have a scenario where I have to do sum(column x) for month range and sum (column x) for a range of year and insert both the records into the next table, listing with example

example: sale_id,saleValue,date

1,1000,2015/12/14

2,2000,2015/11/01

3,3000,2015/12/01

4,4000,2015/01/01

Here when we try for this month the output is sum(salevalue) for id's 1 and 3, and for last year it's sum(salevalue) for id's 1,2,3 and 4

And I have to insert both the values in to next table

Appreciate your help

1 ACCEPTED SOLUTION

Accepted Solutions

Re: Handling multiple records in hive

New Contributor

You should look to GROUP BY WITH CUBE and ROLLUP to get multiple levels of aggregation in a single pass through the data. Take a look at https://cwiki.apache.org/confluence/display/Hive/E... for the documentation.

Here's a query that will get you what you want. You can filter the results from there.

SELECT
  SUM(saleValue), yr, yrmo
FROM (
  SELECT sale_id, saleValue, SUBSTR(`date`,1,4) AS yr, SUBSTR(`date`,1,7) AS yrmo
  FROM detaildata
) a 
GROUP BY yr, yrmo WITH ROLLUP

The result you get looks like this:

totalSaleValue  yr    yrmo
10000
10000           2015
4000            2015  2015/01
2000            2015  2015/11
4000            2015  2015/12

You can get at the yearly totals by filtering "WHERE yrmo IS NULL" or the December total with "WHERE yrmo = '2015/12'"

4 REPLIES 4

Re: Handling multiple records in hive

@Ranjith M

Try two options below. It will read your source table just once and will create 2 lines:

create external table test_groupby 
(sale_id string, salevalue double, datex string) 
row format delimited fields terminated by ',' 
stored as textfile 
location '/tmp/groupby'
;

select * from test_groupby;

select inline(
  array(
    named_struct('type', 'month', 'value', sum(case when substr(datex, 1,7) = '2015/12' then salevalue end)),
    named_struct('type', 'year', 'value', sum(case when substr(datex, 1,4) = '2015' then salevalue end))
  ))
from test_groupby
;
create external table test_groupby 
(sale_id string, salevalue double, datex string) 
row format delimited fields terminated by ',' 
stored as textfile 
location '/tmp/groupby'
;

select * from test_groupby;

drop table test_groupby_result;
create table test_groupby_result
(value double)
partitioned by (type string)
;

from test_groupby
insert into table test_groupby_result partition (type = 'month')
select sum(salevalue)
where substr(datex, 1,7) = '2015/12'

insert into table test_groupby_result partition (type = 'year')
select sum(salevalue)
where substr(datex, 1,4) = '2015'
;

select * from test_groupby_result;

Re: Handling multiple records in hive

New Contributor

Thank you very much @Guilherme Braccialli I will try to implement this logic

Will let you know updates

Re: Handling multiple records in hive

New Contributor

You should look to GROUP BY WITH CUBE and ROLLUP to get multiple levels of aggregation in a single pass through the data. Take a look at https://cwiki.apache.org/confluence/display/Hive/E... for the documentation.

Here's a query that will get you what you want. You can filter the results from there.

SELECT
  SUM(saleValue), yr, yrmo
FROM (
  SELECT sale_id, saleValue, SUBSTR(`date`,1,4) AS yr, SUBSTR(`date`,1,7) AS yrmo
  FROM detaildata
) a 
GROUP BY yr, yrmo WITH ROLLUP

The result you get looks like this:

totalSaleValue  yr    yrmo
10000
10000           2015
4000            2015  2015/01
2000            2015  2015/11
4000            2015  2015/12

You can get at the yearly totals by filtering "WHERE yrmo IS NULL" or the December total with "WHERE yrmo = '2015/12'"

Re: Handling multiple records in hive

Awesome!