Support Questions

Find answers, ask questions, and share your expertise

Why should we group using Apache PIG

avatar
Rising Star

Hi guys, I'm very new in using Apache PIG, and I already see a lot of Scripts using Group stament without any operator (Like Sum(X), A Group by A). Why is a good alternative to use group statement? Thanks!

1 ACCEPTED SOLUTION

avatar
Super Collaborator

Group is used to collect data having the same key. It is not mandatory to have an aggregation to be performed along with group.

For a better understanding, let us consider a file with ID,Name and Age as below

1,John,23
2,James,24
3,Alice,30
4,Bob,23
5,Bill,24
If we have the below script applied on the file, loading the file and grouping it by age, we get all the data associated to one age into one single group.
details = LOAD 'file' USING PigStorage(',') as (id:int, name:chararray, age:int);
grouped_data = GROUP details by age;
dump grouped_data;

Output being

(23,{(1,John,23),(4,Bob,23)})
(24,{(2,James,24),(5,Bill,24)})
(30,{(3,Alice,30)})

Further more, if you describe the schema of the grouped data, you would see as below

describe grouped_data;
grouped_data: {group: int,details: {(id: int,name: chararray,age: int)}}

You can explore more here

View solution in original post

4 REPLIES 4

avatar
Super Collaborator

Group is used to collect data having the same key. It is not mandatory to have an aggregation to be performed along with group.

For a better understanding, let us consider a file with ID,Name and Age as below

1,John,23
2,James,24
3,Alice,30
4,Bob,23
5,Bill,24
If we have the below script applied on the file, loading the file and grouping it by age, we get all the data associated to one age into one single group.
details = LOAD 'file' USING PigStorage(',') as (id:int, name:chararray, age:int);
grouped_data = GROUP details by age;
dump grouped_data;

Output being

(23,{(1,John,23),(4,Bob,23)})
(24,{(2,James,24),(5,Bill,24)})
(30,{(3,Alice,30)})

Further more, if you describe the schema of the grouped data, you would see as below

describe grouped_data;
grouped_data: {group: int,details: {(id: int,name: chararray,age: int)}}

You can explore more here

avatar
Super Collaborator

++ You can group by multiple columns or even by all

avatar
Rising Star

Just top Arun A K 🙂 Many thanks!

avatar
Super Collaborator

You are welcome @Pedro Rodgers