Community Articles

Find and share helpful community-sourced technical articles.
avatar

Although it is a simple solution, there is not much reference to this exact problem in the www.

We have seen people doing multiple projection, flattening and group by to get the data into shape for storing into the final table using pig after the aggregation and group operations.

Source Table:

CREATE EXTERNAL TABLE IF NOT EXISTS source(
pk1  string,
pk2  string,
agg1 INT,
agg2 INT
)
STORED AS ORC tblproperties("orc.compress"="ZLIB");

Data:

8625-data.jpg

Target Table:

CREATE EXTERNAL TABLE IF NOT EXISTS target_aggregated(
pk1 string,
pk2 string,
sum_agg1 BIGINT,
sum_agg2 BIGINT
)
STORED AS ORC tblproperties("orc.compress"="ZLIB");

Pig Script:

--Load the data in pig relations
staging = LOAD 'DEFAULT.SOURCE' USING org.apache.hive.hcatalog.pig.HCatLoader();
--Group the data
group_staging = group staging BY (pk1,pk2);
--Flatten the grouped data and generate aggregates with same attribute names as the target table
calculate_group_staging = FOREACH group_staging GENERATE FLATTEN(group) AS(pk1,pk2),SUM(staging.agg1) as sum_agg1, SUM(staging.agg2) as sum_agg2;
--Order the data if required
calculate_group_staging_ordered = ORDER calculate_group_staging BY pk1,pk2;
--Store the data using HCatStorer
--Data will be automatically dereferenced by using the HCatalog metastore
STORE calculate_group_staging_ordered INTO 'DEFAULT.TARGET_AGGREGATED' USING org.apache.hive.hcatalog.pig.HCatStorer();

Final Results:

8628-fr.png

Hope this helps.

Thanks.

1,689 Views