Created 08-02-2016 08:04 PM
Hi ,
am joining two relations(withoutschema) in pig and want to pick particular columns from both relations.
A = load 'data1' using PigStorage(',');($0,$1...$8)
B = load 'data2' using PigStoarge(',');($0..$4);
C = foreach(join A by($1,$2),B by($1,$2)) generate $0,$1,$4,$5,(how to select B relation columns)
Created 08-03-2016 05:55 PM
here's my solution, considering that result of join is sum of all fields, then if you have 10 columns in A and 4 columns in B, your result row will 14, you can cherry pick columns 1, 2,3 from A and 11, 12, 13 from B.
grunt> fs -cat email_list.csv; 1,Christine,Romero,cromero0@eventbrite.com 2,Sara,Hansen,shansen1@tinypic.com 3,Albert,Rogers,arogers2@marriott.com 4,Kimberly,Morrison,kmorrison3@irs.gov 5,Eugene,Baker,ebaker4@cbslocal.com 6,Ann,Alexander,aalexander5@hhs.gov 7,Kathleen,Reed,kreed6@youtu.be 8,Todd,Scott,tscott7@deliciousdays.com 9,Sharon,Mccoy,smccoy8@nature.com 10,Evelyn,Rice,erice9@narod.ru grunt> fs -cat gender_list.csv; 1,Christine,Romero,Female 2,Sara,Hansen,Female 3,Albert,Rogers,Male 4,Kimberly,Morrison,Female 5,Eugene,Baker,Male 6,Ann,Alexander,Female 7,Kathleen,Reed,Female 8,Todd,Scott,Male 9,Sharon,Mccoy,Female 10,Evelyn,Rice,Female grunt> A = load 'email_list.csv' using PigStorage(','); grunt> B = load 'gender_list.csv' using PigStorage(','); grunt> C = join A by ($0, $1, $2), B by ($0, $1, $2); grunt> dump C; (1,Christine,Romero,cromero0@eventbrite.com,1,Christine,Romero,Female) (10,Evelyn,Rice,erice9@narod.ru,10,Evelyn,Rice,Female) (2,Sara,Hansen,shansen1@tinypic.com,2,Sara,Hansen,Female) (3,Albert,Rogers,arogers2@marriott.com,3,Albert,Rogers,Male) (4,Kimberly,Morrison,kmorrison3@irs.gov,4,Kimberly,Morrison,Female) (5,Eugene,Baker,ebaker4@cbslocal.com,5,Eugene,Baker,Male) (6,Ann,Alexander,aalexander5@hhs.gov,6,Ann,Alexander,Female) (7,Kathleen,Reed,kreed6@youtu.be,7,Kathleen,Reed,Female) (8,Todd,Scott,tscott7@deliciousdays.com,8,Todd,Scott,Male) (9,Sharon,Mccoy,smccoy8@nature.com,9,Sharon,Mccoy,Female) grunt> D = foreach C generate $0, $1, $2, $3, $7; grunt> dump D; (1,Christine,Romero,cromero0@eventbrite.com,Female) (10,Evelyn,Rice,erice9@narod.ru,Female) (2,Sara,Hansen,shansen1@tinypic.com,Female) (3,Albert,Rogers,arogers2@marriott.com,Male) (4,Kimberly,Morrison,kmorrison3@irs.gov,Female) (5,Eugene,Baker,ebaker4@cbslocal.com,Male) (6,Ann,Alexander,aalexander5@hhs.gov,Female) (7,Kathleen,Reed,kreed6@youtu.be,Female) (8,Todd,Scott,tscott7@deliciousdays.com,Male) (9,Sharon,Mccoy,smccoy8@nature.com,Female)
Created 08-02-2016 08:33 PM
@jayaprakash gadi please try this.. I haven't tested yet.
C = foreach(join A by($1,$2),B by($1,$2)) generate B.*
Created 08-03-2016 12:03 AM
am able to pick particular columns from relations A & B where relations have schema but if relations doesn't have any schema then it's not working.
Created 08-03-2016 02:22 AM
After join the elements from A are at positions $0 .. $8, the elements from B are at $9 .. $13. Also, observe Performance enhencers: Use types, and Project early and often which in your case means to remove un-needed elements before the join.
Created 08-04-2016 06:39 PM
it worked ..thanks
Created 08-03-2016 05:55 PM
here's my solution, considering that result of join is sum of all fields, then if you have 10 columns in A and 4 columns in B, your result row will 14, you can cherry pick columns 1, 2,3 from A and 11, 12, 13 from B.
grunt> fs -cat email_list.csv; 1,Christine,Romero,cromero0@eventbrite.com 2,Sara,Hansen,shansen1@tinypic.com 3,Albert,Rogers,arogers2@marriott.com 4,Kimberly,Morrison,kmorrison3@irs.gov 5,Eugene,Baker,ebaker4@cbslocal.com 6,Ann,Alexander,aalexander5@hhs.gov 7,Kathleen,Reed,kreed6@youtu.be 8,Todd,Scott,tscott7@deliciousdays.com 9,Sharon,Mccoy,smccoy8@nature.com 10,Evelyn,Rice,erice9@narod.ru grunt> fs -cat gender_list.csv; 1,Christine,Romero,Female 2,Sara,Hansen,Female 3,Albert,Rogers,Male 4,Kimberly,Morrison,Female 5,Eugene,Baker,Male 6,Ann,Alexander,Female 7,Kathleen,Reed,Female 8,Todd,Scott,Male 9,Sharon,Mccoy,Female 10,Evelyn,Rice,Female grunt> A = load 'email_list.csv' using PigStorage(','); grunt> B = load 'gender_list.csv' using PigStorage(','); grunt> C = join A by ($0, $1, $2), B by ($0, $1, $2); grunt> dump C; (1,Christine,Romero,cromero0@eventbrite.com,1,Christine,Romero,Female) (10,Evelyn,Rice,erice9@narod.ru,10,Evelyn,Rice,Female) (2,Sara,Hansen,shansen1@tinypic.com,2,Sara,Hansen,Female) (3,Albert,Rogers,arogers2@marriott.com,3,Albert,Rogers,Male) (4,Kimberly,Morrison,kmorrison3@irs.gov,4,Kimberly,Morrison,Female) (5,Eugene,Baker,ebaker4@cbslocal.com,5,Eugene,Baker,Male) (6,Ann,Alexander,aalexander5@hhs.gov,6,Ann,Alexander,Female) (7,Kathleen,Reed,kreed6@youtu.be,7,Kathleen,Reed,Female) (8,Todd,Scott,tscott7@deliciousdays.com,8,Todd,Scott,Male) (9,Sharon,Mccoy,smccoy8@nature.com,9,Sharon,Mccoy,Female) grunt> D = foreach C generate $0, $1, $2, $3, $7; grunt> dump D; (1,Christine,Romero,cromero0@eventbrite.com,Female) (10,Evelyn,Rice,erice9@narod.ru,Female) (2,Sara,Hansen,shansen1@tinypic.com,Female) (3,Albert,Rogers,arogers2@marriott.com,Male) (4,Kimberly,Morrison,kmorrison3@irs.gov,Female) (5,Eugene,Baker,ebaker4@cbslocal.com,Male) (6,Ann,Alexander,aalexander5@hhs.gov,Female) (7,Kathleen,Reed,kreed6@youtu.be,Female) (8,Todd,Scott,tscott7@deliciousdays.com,Male) (9,Sharon,Mccoy,smccoy8@nature.com,Female)