Support Questions

Find answers, ask questions, and share your expertise
Announcements
Welcome to the upgraded Community! Read this blog to see What’s New!

Handling multiple records in hive

avatar
New Contributor

Hi All,

I have a scenario where I have to do sum(column x) for month range and sum (column x) for a range of year and insert both the records into the next table, listing with example

example: sale_id,saleValue,date

1,1000,2015/12/14

2,2000,2015/11/01

3,3000,2015/12/01

4,4000,2015/01/01

Here when we try for this month the output is sum(salevalue) for id's 1 and 3, and for last year it's sum(salevalue) for id's 1,2,3 and 4

And I have to insert both the values in to next table

Appreciate your help

1 ACCEPTED SOLUTION

avatar
Explorer

You should look to GROUP BY WITH CUBE and ROLLUP to get multiple levels of aggregation in a single pass through the data. Take a look at https://cwiki.apache.org/confluence/display/Hive/E... for the documentation.

Here's a query that will get you what you want. You can filter the results from there.

SELECT
  SUM(saleValue), yr, yrmo
FROM (
  SELECT sale_id, saleValue, SUBSTR(`date`,1,4) AS yr, SUBSTR(`date`,1,7) AS yrmo
  FROM detaildata
) a 
GROUP BY yr, yrmo WITH ROLLUP

The result you get looks like this:

totalSaleValue  yr    yrmo
10000
10000           2015
4000            2015  2015/01
2000            2015  2015/11
4000            2015  2015/12

You can get at the yearly totals by filtering "WHERE yrmo IS NULL" or the December total with "WHERE yrmo = '2015/12'"

View solution in original post

4 REPLIES 4

avatar

@Ranjith M

Try two options below. It will read your source table just once and will create 2 lines:

create external table test_groupby 
(sale_id string, salevalue double, datex string) 
row format delimited fields terminated by ',' 
stored as textfile 
location '/tmp/groupby'
;

select * from test_groupby;

select inline(
  array(
    named_struct('type', 'month', 'value', sum(case when substr(datex, 1,7) = '2015/12' then salevalue end)),
    named_struct('type', 'year', 'value', sum(case when substr(datex, 1,4) = '2015' then salevalue end))
  ))
from test_groupby
;
create external table test_groupby 
(sale_id string, salevalue double, datex string) 
row format delimited fields terminated by ',' 
stored as textfile 
location '/tmp/groupby'
;

select * from test_groupby;

drop table test_groupby_result;
create table test_groupby_result
(value double)
partitioned by (type string)
;

from test_groupby
insert into table test_groupby_result partition (type = 'month')
select sum(salevalue)
where substr(datex, 1,7) = '2015/12'

insert into table test_groupby_result partition (type = 'year')
select sum(salevalue)
where substr(datex, 1,4) = '2015'
;

select * from test_groupby_result;

avatar
New Contributor

Thank you very much @Guilherme Braccialli I will try to implement this logic

Will let you know updates

avatar
Explorer

You should look to GROUP BY WITH CUBE and ROLLUP to get multiple levels of aggregation in a single pass through the data. Take a look at https://cwiki.apache.org/confluence/display/Hive/E... for the documentation.

Here's a query that will get you what you want. You can filter the results from there.

SELECT
  SUM(saleValue), yr, yrmo
FROM (
  SELECT sale_id, saleValue, SUBSTR(`date`,1,4) AS yr, SUBSTR(`date`,1,7) AS yrmo
  FROM detaildata
) a 
GROUP BY yr, yrmo WITH ROLLUP

The result you get looks like this:

totalSaleValue  yr    yrmo
10000
10000           2015
4000            2015  2015/01
2000            2015  2015/11
4000            2015  2015/12

You can get at the yearly totals by filtering "WHERE yrmo IS NULL" or the December total with "WHERE yrmo = '2015/12'"

avatar

Awesome!

Labels