Options
- Subscribe to RSS Feed
- Mark as New
- Mark as Read
- Bookmark
- Subscribe
- Printer Friendly Page
- Report Inappropriate Content
Guru
Created on 10-17-2016 07:02 PM - edited 08-17-2019 08:46 AM
Although it is a simple solution, there is not much reference to this exact problem in the www.
We have seen people doing multiple projection, flattening and group by to get the data into shape for storing into the final table using pig after the aggregation and group operations.
Source Table:
CREATE EXTERNAL TABLE IF NOT EXISTS source( pk1 string, pk2 string, agg1 INT, agg2 INT ) STORED AS ORC tblproperties("orc.compress"="ZLIB");
Data:
Target Table:
CREATE EXTERNAL TABLE IF NOT EXISTS target_aggregated( pk1 string, pk2 string, sum_agg1 BIGINT, sum_agg2 BIGINT ) STORED AS ORC tblproperties("orc.compress"="ZLIB");
Pig Script:
--Load the data in pig relations staging = LOAD 'DEFAULT.SOURCE' USING org.apache.hive.hcatalog.pig.HCatLoader(); --Group the data group_staging = group staging BY (pk1,pk2); --Flatten the grouped data and generate aggregates with same attribute names as the target table calculate_group_staging = FOREACH group_staging GENERATE FLATTEN(group) AS(pk1,pk2),SUM(staging.agg1) as sum_agg1, SUM(staging.agg2) as sum_agg2; --Order the data if required calculate_group_staging_ordered = ORDER calculate_group_staging BY pk1,pk2; --Store the data using HCatStorer --Data will be automatically dereferenced by using the HCatalog metastore STORE calculate_group_staging_ordered INTO 'DEFAULT.TARGET_AGGREGATED' USING org.apache.hive.hcatalog.pig.HCatStorer();
Final Results:
Hope this helps.
Thanks.
1,726 Views