Support Questions

Find answers, ask questions, and share your expertise
Announcements
Check out our newest addition to the community, the Cloudera Data Analytics (CDA) group hub.

Why should we group using Apache PIG

Hi guys, I'm very new in using Apache PIG, and I already see a lot of Scripts using Group stament without any operator (Like Sum(X), A Group by A). Why is a good alternative to use group statement? Thanks!

1 ACCEPTED SOLUTION

Super Collaborator

Group is used to collect data having the same key. It is not mandatory to have an aggregation to be performed along with group.

For a better understanding, let us consider a file with ID,Name and Age as below

1,John,23
2,James,24
3,Alice,30
4,Bob,23
5,Bill,24
If we have the below script applied on the file, loading the file and grouping it by age, we get all the data associated to one age into one single group.
details = LOAD 'file' USING PigStorage(',') as (id:int, name:chararray, age:int);
grouped_data = GROUP details by age;
dump grouped_data;

Output being

(23,{(1,John,23),(4,Bob,23)})
(24,{(2,James,24),(5,Bill,24)})
(30,{(3,Alice,30)})

Further more, if you describe the schema of the grouped data, you would see as below

describe grouped_data;
grouped_data: {group: int,details: {(id: int,name: chararray,age: int)}}

You can explore more here

View solution in original post

4 REPLIES 4

Super Collaborator

Group is used to collect data having the same key. It is not mandatory to have an aggregation to be performed along with group.

For a better understanding, let us consider a file with ID,Name and Age as below

1,John,23
2,James,24
3,Alice,30
4,Bob,23
5,Bill,24
If we have the below script applied on the file, loading the file and grouping it by age, we get all the data associated to one age into one single group.
details = LOAD 'file' USING PigStorage(',') as (id:int, name:chararray, age:int);
grouped_data = GROUP details by age;
dump grouped_data;

Output being

(23,{(1,John,23),(4,Bob,23)})
(24,{(2,James,24),(5,Bill,24)})
(30,{(3,Alice,30)})

Further more, if you describe the schema of the grouped data, you would see as below

describe grouped_data;
grouped_data: {group: int,details: {(id: int,name: chararray,age: int)}}

You can explore more here

Super Collaborator

++ You can group by multiple columns or even by all

Just top Arun A K 🙂 Many thanks!

Super Collaborator

You are welcome @Pedro Rodgers

Take a Tour of the Community
Don't have an account?
Your experience may be limited. Sign in to explore more.