Support Questions

Find answers, ask questions, and share your expertise

How can select columns from two relations after join in pig.

avatar

Hi ,

am joining two relations(withoutschema) in pig and want to pick particular columns from both relations.

A = load 'data1' using PigStorage(',');($0,$1...$8)

B = load 'data2' using PigStoarge(',');($0..$4);

C = foreach(join A by($1,$2),B by($1,$2)) generate $0,$1,$4,$5,(how to select B relation columns)

1 ACCEPTED SOLUTION

avatar
Master Mentor

@jayaprakash gadi

here's my solution, considering that result of join is sum of all fields, then if you have 10 columns in A and 4 columns in B, your result row will 14, you can cherry pick columns 1, 2,3 from A and 11, 12, 13 from B.

grunt> fs -cat email_list.csv;
1,Christine,Romero,cromero0@eventbrite.com
2,Sara,Hansen,shansen1@tinypic.com
3,Albert,Rogers,arogers2@marriott.com
4,Kimberly,Morrison,kmorrison3@irs.gov
5,Eugene,Baker,ebaker4@cbslocal.com
6,Ann,Alexander,aalexander5@hhs.gov
7,Kathleen,Reed,kreed6@youtu.be
8,Todd,Scott,tscott7@deliciousdays.com
9,Sharon,Mccoy,smccoy8@nature.com
10,Evelyn,Rice,erice9@narod.ru

grunt> fs -cat gender_list.csv;
1,Christine,Romero,Female
2,Sara,Hansen,Female
3,Albert,Rogers,Male
4,Kimberly,Morrison,Female
5,Eugene,Baker,Male
6,Ann,Alexander,Female
7,Kathleen,Reed,Female
8,Todd,Scott,Male
9,Sharon,Mccoy,Female
10,Evelyn,Rice,Female

grunt> A = load 'email_list.csv' using PigStorage(',');
grunt> B = load 'gender_list.csv' using PigStorage(',');
grunt> C = join A by ($0, $1, $2), B by ($0, $1, $2);
grunt> dump C;

(1,Christine,Romero,cromero0@eventbrite.com,1,Christine,Romero,Female)
(10,Evelyn,Rice,erice9@narod.ru,10,Evelyn,Rice,Female)
(2,Sara,Hansen,shansen1@tinypic.com,2,Sara,Hansen,Female)
(3,Albert,Rogers,arogers2@marriott.com,3,Albert,Rogers,Male)
(4,Kimberly,Morrison,kmorrison3@irs.gov,4,Kimberly,Morrison,Female)
(5,Eugene,Baker,ebaker4@cbslocal.com,5,Eugene,Baker,Male)
(6,Ann,Alexander,aalexander5@hhs.gov,6,Ann,Alexander,Female)
(7,Kathleen,Reed,kreed6@youtu.be,7,Kathleen,Reed,Female)
(8,Todd,Scott,tscott7@deliciousdays.com,8,Todd,Scott,Male)
(9,Sharon,Mccoy,smccoy8@nature.com,9,Sharon,Mccoy,Female)

grunt> D = foreach C generate $0, $1, $2, $3, $7;
grunt> dump D;

(1,Christine,Romero,cromero0@eventbrite.com,Female)
(10,Evelyn,Rice,erice9@narod.ru,Female)
(2,Sara,Hansen,shansen1@tinypic.com,Female)
(3,Albert,Rogers,arogers2@marriott.com,Male)
(4,Kimberly,Morrison,kmorrison3@irs.gov,Female)
(5,Eugene,Baker,ebaker4@cbslocal.com,Male)
(6,Ann,Alexander,aalexander5@hhs.gov,Female)
(7,Kathleen,Reed,kreed6@youtu.be,Female)
(8,Todd,Scott,tscott7@deliciousdays.com,Male)
(9,Sharon,Mccoy,smccoy8@nature.com,Female)

View solution in original post

5 REPLIES 5

avatar
Master Guru

@jayaprakash gadi please try this.. I haven't tested yet.

C = foreach(join A by($1,$2),B by($1,$2)) generate B.*

avatar

am able to pick particular columns from relations A & B where relations have schema but if relations doesn't have any schema then it's not working.

avatar
Master Guru

After join the elements from A are at positions $0 .. $8, the elements from B are at $9 .. $13. Also, observe Performance enhencers: Use types, and Project early and often which in your case means to remove un-needed elements before the join.

avatar

it worked ..thanks

avatar
Master Mentor

@jayaprakash gadi

here's my solution, considering that result of join is sum of all fields, then if you have 10 columns in A and 4 columns in B, your result row will 14, you can cherry pick columns 1, 2,3 from A and 11, 12, 13 from B.

grunt> fs -cat email_list.csv;
1,Christine,Romero,cromero0@eventbrite.com
2,Sara,Hansen,shansen1@tinypic.com
3,Albert,Rogers,arogers2@marriott.com
4,Kimberly,Morrison,kmorrison3@irs.gov
5,Eugene,Baker,ebaker4@cbslocal.com
6,Ann,Alexander,aalexander5@hhs.gov
7,Kathleen,Reed,kreed6@youtu.be
8,Todd,Scott,tscott7@deliciousdays.com
9,Sharon,Mccoy,smccoy8@nature.com
10,Evelyn,Rice,erice9@narod.ru

grunt> fs -cat gender_list.csv;
1,Christine,Romero,Female
2,Sara,Hansen,Female
3,Albert,Rogers,Male
4,Kimberly,Morrison,Female
5,Eugene,Baker,Male
6,Ann,Alexander,Female
7,Kathleen,Reed,Female
8,Todd,Scott,Male
9,Sharon,Mccoy,Female
10,Evelyn,Rice,Female

grunt> A = load 'email_list.csv' using PigStorage(',');
grunt> B = load 'gender_list.csv' using PigStorage(',');
grunt> C = join A by ($0, $1, $2), B by ($0, $1, $2);
grunt> dump C;

(1,Christine,Romero,cromero0@eventbrite.com,1,Christine,Romero,Female)
(10,Evelyn,Rice,erice9@narod.ru,10,Evelyn,Rice,Female)
(2,Sara,Hansen,shansen1@tinypic.com,2,Sara,Hansen,Female)
(3,Albert,Rogers,arogers2@marriott.com,3,Albert,Rogers,Male)
(4,Kimberly,Morrison,kmorrison3@irs.gov,4,Kimberly,Morrison,Female)
(5,Eugene,Baker,ebaker4@cbslocal.com,5,Eugene,Baker,Male)
(6,Ann,Alexander,aalexander5@hhs.gov,6,Ann,Alexander,Female)
(7,Kathleen,Reed,kreed6@youtu.be,7,Kathleen,Reed,Female)
(8,Todd,Scott,tscott7@deliciousdays.com,8,Todd,Scott,Male)
(9,Sharon,Mccoy,smccoy8@nature.com,9,Sharon,Mccoy,Female)

grunt> D = foreach C generate $0, $1, $2, $3, $7;
grunt> dump D;

(1,Christine,Romero,cromero0@eventbrite.com,Female)
(10,Evelyn,Rice,erice9@narod.ru,Female)
(2,Sara,Hansen,shansen1@tinypic.com,Female)
(3,Albert,Rogers,arogers2@marriott.com,Male)
(4,Kimberly,Morrison,kmorrison3@irs.gov,Female)
(5,Eugene,Baker,ebaker4@cbslocal.com,Male)
(6,Ann,Alexander,aalexander5@hhs.gov,Female)
(7,Kathleen,Reed,kreed6@youtu.be,Female)
(8,Todd,Scott,tscott7@deliciousdays.com,Male)
(9,Sharon,Mccoy,smccoy8@nature.com,Female)