Created on 10-17-201607:02 PM - edited 08-17-201908:46 AM
Although it
is a simple solution, there is not much reference to this exact problem in the
www.
We have seen
people doing multiple projection, flattening and group by to get the data into
shape for storing into the final table using pig after the aggregation and
group operations.
Source
Table:
CREATE EXTERNAL TABLE IF NOT EXISTS source(
pk1 string,
pk2 string,
agg1 INT,
agg2 INT
)
STORED AS ORC tblproperties("orc.compress"="ZLIB");
Data:
Target
Table:
CREATE EXTERNAL TABLE IF NOT EXISTS target_aggregated(
pk1 string,
pk2 string,
sum_agg1 BIGINT,
sum_agg2 BIGINT
)
STORED AS ORC tblproperties("orc.compress"="ZLIB");
Pig Script:
--Load the data in pig relations
staging = LOAD 'DEFAULT.SOURCE' USING org.apache.hive.hcatalog.pig.HCatLoader();
--Group the data
group_staging = group staging BY (pk1,pk2);
--Flatten the grouped data and generate aggregates with same attribute names as the target table
calculate_group_staging = FOREACH group_staging GENERATE FLATTEN(group) AS(pk1,pk2),SUM(staging.agg1) as sum_agg1, SUM(staging.agg2) as sum_agg2;
--Order the data if required
calculate_group_staging_ordered = ORDER calculate_group_staging BY pk1,pk2;
--Store the data using HCatStorer
--Data will be automatically dereferenced by using the HCatalog metastore
STORE calculate_group_staging_ordered INTO 'DEFAULT.TARGET_AGGREGATED' USING org.apache.hive.hcatalog.pig.HCatStorer();