Created 08-10-2016 02:13 PM
Hi guys, I'm very new in using Apache PIG, and I already see a lot of Scripts using Group stament without any operator (Like Sum(X), A Group by A). Why is a good alternative to use group statement? Thanks!
Created 08-10-2016 06:28 PM
Group is used to collect data having the same key. It is not mandatory to have an aggregation to be performed along with group.
For a better understanding, let us consider a file with ID,Name and Age as below
1,John,23 2,James,24 3,Alice,30 4,Bob,23 5,Bill,24If we have the below script applied on the file, loading the file and grouping it by age, we get all the data associated to one age into one single group.
details = LOAD 'file' USING PigStorage(',') as (id:int, name:chararray, age:int); grouped_data = GROUP details by age; dump grouped_data;
Output being
(23,{(1,John,23),(4,Bob,23)}) (24,{(2,James,24),(5,Bill,24)}) (30,{(3,Alice,30)})
Further more, if you describe the schema of the grouped data, you would see as below
describe grouped_data; grouped_data: {group: int,details: {(id: int,name: chararray,age: int)}}
You can explore more here
Created 08-10-2016 06:28 PM
Group is used to collect data having the same key. It is not mandatory to have an aggregation to be performed along with group.
For a better understanding, let us consider a file with ID,Name and Age as below
1,John,23 2,James,24 3,Alice,30 4,Bob,23 5,Bill,24If we have the below script applied on the file, loading the file and grouping it by age, we get all the data associated to one age into one single group.
details = LOAD 'file' USING PigStorage(',') as (id:int, name:chararray, age:int); grouped_data = GROUP details by age; dump grouped_data;
Output being
(23,{(1,John,23),(4,Bob,23)}) (24,{(2,James,24),(5,Bill,24)}) (30,{(3,Alice,30)})
Further more, if you describe the schema of the grouped data, you would see as below
describe grouped_data; grouped_data: {group: int,details: {(id: int,name: chararray,age: int)}}
You can explore more here
Created 08-10-2016 06:30 PM
++ You can group by multiple columns or even by all
Created 08-10-2016 10:15 PM
Just top Arun A K 🙂 Many thanks!
Created 08-10-2016 11:39 PM
You are welcome @Pedro Rodgers