- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
How to FILTER in nested foreach
- Labels:
-
Apache Pig
Created ‎03-16-2017 12:34 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have many files. one of which,say, header.csv, serves as a header file,i.e., it contains primary key(in database analogy) which servers as foreign key in the rest of the files.
Now, I want to do FOREACH and FILTER as follows:
A =LOAD 'header.csv' AS (Id:chararray,f1:chararrat,f2:chararray);
B = LOAD 'file1.csv' AS (Id:chararray,t1:chararray);
C = LOAD 'file2.csv' AS (Id:chararray)
..........
D = foreach A {
file1_filtered = FILTER file1 BY Id == A.Id;
file2_filtered = FILTER file2 BY Id == A.Id;
GENERATE file1_filtered,file2_filtered;
};
Finally I need to access the relations file1_filtered and file2_filtered. When I follow this approach I got the following error:
"ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: Pig script failed to parse: <line 2651, column 28> expression is not a project expression: (Name: ScalarExpression) Type: null Uid: null)"
How can I achieve this in Pig?
Created ‎03-17-2017 06:17 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
We can achieve this using JOIN as follows.
1. JOIN A and B BY Id.
B_joined = JOIN A by Id, B by Id;
2. JOIN A and C by Id:
C_joined = JOIN A by Id, C by Id;
Now, we can get the required fields of A and C from their respective joined data sets as follows:
B_filtered = FOREACH B_joined GENERATE B::Id,B::t1;
C_filtered =FOREACH C_joined GENERATE C::Id;
Created ‎03-17-2017 06:17 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
We can achieve this using JOIN as follows.
1. JOIN A and B BY Id.
B_joined = JOIN A by Id, B by Id;
2. JOIN A and C by Id:
C_joined = JOIN A by Id, C by Id;
Now, we can get the required fields of A and C from their respective joined data sets as follows:
B_filtered = FOREACH B_joined GENERATE B::Id,B::t1;
C_filtered =FOREACH C_joined GENERATE C::Id;
