Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

How can select columns from two relations after join in pig.

avatar

Hi ,

am joining two relations(withoutschema) in pig and want to pick particular columns from both relations.

A = load 'data1' using PigStorage(',');($0,$1...$8)

B = load 'data2' using PigStoarge(',');($0..$4);

C = foreach(join A by($1,$2),B by($1,$2)) generate $0,$1,$4,$5,(how to select B relation columns)

1 ACCEPTED SOLUTION

avatar
Master Mentor

@jayaprakash gadi

here's my solution, considering that result of join is sum of all fields, then if you have 10 columns in A and 4 columns in B, your result row will 14, you can cherry pick columns 1, 2,3 from A and 11, 12, 13 from B.

grunt> fs -cat email_list.csv;
1,Christine,Romero,cromero0@eventbrite.com
2,Sara,Hansen,shansen1@tinypic.com
3,Albert,Rogers,arogers2@marriott.com
4,Kimberly,Morrison,kmorrison3@irs.gov
5,Eugene,Baker,ebaker4@cbslocal.com
6,Ann,Alexander,aalexander5@hhs.gov
7,Kathleen,Reed,kreed6@youtu.be
8,Todd,Scott,tscott7@deliciousdays.com
9,Sharon,Mccoy,smccoy8@nature.com
10,Evelyn,Rice,erice9@narod.ru

grunt> fs -cat gender_list.csv;
1,Christine,Romero,Female
2,Sara,Hansen,Female
3,Albert,Rogers,Male
4,Kimberly,Morrison,Female
5,Eugene,Baker,Male
6,Ann,Alexander,Female
7,Kathleen,Reed,Female
8,Todd,Scott,Male
9,Sharon,Mccoy,Female
10,Evelyn,Rice,Female

grunt> A = load 'email_list.csv' using PigStorage(',');
grunt> B = load 'gender_list.csv' using PigStorage(',');
grunt> C = join A by ($0, $1, $2), B by ($0, $1, $2);
grunt> dump C;

(1,Christine,Romero,cromero0@eventbrite.com,1,Christine,Romero,Female)
(10,Evelyn,Rice,erice9@narod.ru,10,Evelyn,Rice,Female)
(2,Sara,Hansen,shansen1@tinypic.com,2,Sara,Hansen,Female)
(3,Albert,Rogers,arogers2@marriott.com,3,Albert,Rogers,Male)
(4,Kimberly,Morrison,kmorrison3@irs.gov,4,Kimberly,Morrison,Female)
(5,Eugene,Baker,ebaker4@cbslocal.com,5,Eugene,Baker,Male)
(6,Ann,Alexander,aalexander5@hhs.gov,6,Ann,Alexander,Female)
(7,Kathleen,Reed,kreed6@youtu.be,7,Kathleen,Reed,Female)
(8,Todd,Scott,tscott7@deliciousdays.com,8,Todd,Scott,Male)
(9,Sharon,Mccoy,smccoy8@nature.com,9,Sharon,Mccoy,Female)

grunt> D = foreach C generate $0, $1, $2, $3, $7;
grunt> dump D;

(1,Christine,Romero,cromero0@eventbrite.com,Female)
(10,Evelyn,Rice,erice9@narod.ru,Female)
(2,Sara,Hansen,shansen1@tinypic.com,Female)
(3,Albert,Rogers,arogers2@marriott.com,Male)
(4,Kimberly,Morrison,kmorrison3@irs.gov,Female)
(5,Eugene,Baker,ebaker4@cbslocal.com,Male)
(6,Ann,Alexander,aalexander5@hhs.gov,Female)
(7,Kathleen,Reed,kreed6@youtu.be,Female)
(8,Todd,Scott,tscott7@deliciousdays.com,Male)
(9,Sharon,Mccoy,smccoy8@nature.com,Female)

View solution in original post

5 REPLIES 5

avatar
Master Guru

@jayaprakash gadi please try this.. I haven't tested yet.

C = foreach(join A by($1,$2),B by($1,$2)) generate B.*

avatar

am able to pick particular columns from relations A & B where relations have schema but if relations doesn't have any schema then it's not working.

avatar
Master Guru

After join the elements from A are at positions $0 .. $8, the elements from B are at $9 .. $13. Also, observe Performance enhencers: Use types, and Project early and often which in your case means to remove un-needed elements before the join.

avatar

it worked ..thanks

avatar
Master Mentor

@jayaprakash gadi

here's my solution, considering that result of join is sum of all fields, then if you have 10 columns in A and 4 columns in B, your result row will 14, you can cherry pick columns 1, 2,3 from A and 11, 12, 13 from B.

grunt> fs -cat email_list.csv;
1,Christine,Romero,cromero0@eventbrite.com
2,Sara,Hansen,shansen1@tinypic.com
3,Albert,Rogers,arogers2@marriott.com
4,Kimberly,Morrison,kmorrison3@irs.gov
5,Eugene,Baker,ebaker4@cbslocal.com
6,Ann,Alexander,aalexander5@hhs.gov
7,Kathleen,Reed,kreed6@youtu.be
8,Todd,Scott,tscott7@deliciousdays.com
9,Sharon,Mccoy,smccoy8@nature.com
10,Evelyn,Rice,erice9@narod.ru

grunt> fs -cat gender_list.csv;
1,Christine,Romero,Female
2,Sara,Hansen,Female
3,Albert,Rogers,Male
4,Kimberly,Morrison,Female
5,Eugene,Baker,Male
6,Ann,Alexander,Female
7,Kathleen,Reed,Female
8,Todd,Scott,Male
9,Sharon,Mccoy,Female
10,Evelyn,Rice,Female

grunt> A = load 'email_list.csv' using PigStorage(',');
grunt> B = load 'gender_list.csv' using PigStorage(',');
grunt> C = join A by ($0, $1, $2), B by ($0, $1, $2);
grunt> dump C;

(1,Christine,Romero,cromero0@eventbrite.com,1,Christine,Romero,Female)
(10,Evelyn,Rice,erice9@narod.ru,10,Evelyn,Rice,Female)
(2,Sara,Hansen,shansen1@tinypic.com,2,Sara,Hansen,Female)
(3,Albert,Rogers,arogers2@marriott.com,3,Albert,Rogers,Male)
(4,Kimberly,Morrison,kmorrison3@irs.gov,4,Kimberly,Morrison,Female)
(5,Eugene,Baker,ebaker4@cbslocal.com,5,Eugene,Baker,Male)
(6,Ann,Alexander,aalexander5@hhs.gov,6,Ann,Alexander,Female)
(7,Kathleen,Reed,kreed6@youtu.be,7,Kathleen,Reed,Female)
(8,Todd,Scott,tscott7@deliciousdays.com,8,Todd,Scott,Male)
(9,Sharon,Mccoy,smccoy8@nature.com,9,Sharon,Mccoy,Female)

grunt> D = foreach C generate $0, $1, $2, $3, $7;
grunt> dump D;

(1,Christine,Romero,cromero0@eventbrite.com,Female)
(10,Evelyn,Rice,erice9@narod.ru,Female)
(2,Sara,Hansen,shansen1@tinypic.com,Female)
(3,Albert,Rogers,arogers2@marriott.com,Male)
(4,Kimberly,Morrison,kmorrison3@irs.gov,Female)
(5,Eugene,Baker,ebaker4@cbslocal.com,Male)
(6,Ann,Alexander,aalexander5@hhs.gov,Female)
(7,Kathleen,Reed,kreed6@youtu.be,Female)
(8,Todd,Scott,tscott7@deliciousdays.com,Male)
(9,Sharon,Mccoy,smccoy8@nature.com,Female)