Archives of Support Questions (Read Only)

This is an archived board for historical reference. Information and links may no longer be available or relevant
Announcements
This board is archived and read-only for historical reference. To ask a new question, please post a new topic on the appropriate active board.

How to FILTER in nested foreach

avatar
Expert Contributor

I have many files. one of which,say, header.csv, serves as a header file,i.e., it contains primary key(in database analogy) which servers as foreign key in the rest of the files.

Now, I want to do FOREACH and FILTER as follows:

A =LOAD 'header.csv' AS (Id:chararray,f1:chararrat,f2:chararray);

B = LOAD 'file1.csv' AS (Id:chararray,t1:chararray);

C = LOAD 'file2.csv' AS (Id:chararray)

..........

D = foreach A {

file1_filtered = FILTER file1 BY Id == A.Id;

file2_filtered = FILTER file2 BY Id == A.Id;

GENERATE file1_filtered,file2_filtered;

};

Finally I need to access the relations file1_filtered and file2_filtered. When I follow this approach I got the following error:

"ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: Pig script failed to parse: <line 2651, column 28> expression is not a project expression: (Name: ScalarExpression) Type: null Uid: null)"

How can I achieve this in Pig?

1 ACCEPTED SOLUTION

avatar
Expert Contributor

We can achieve this using JOIN as follows.

1. JOIN A and B BY Id.

B_joined = JOIN A by Id, B by Id;

2. JOIN A and C by Id:

C_joined = JOIN A by Id, C by Id;

Now, we can get the required fields of A and C from their respective joined data sets as follows:

B_filtered = FOREACH B_joined GENERATE B::Id,B::t1;

C_filtered =FOREACH C_joined GENERATE C::Id;

View solution in original post

1 REPLY 1

avatar
Expert Contributor

We can achieve this using JOIN as follows.

1. JOIN A and B BY Id.

B_joined = JOIN A by Id, B by Id;

2. JOIN A and C by Id:

C_joined = JOIN A by Id, C by Id;

Now, we can get the required fields of A and C from their respective joined data sets as follows:

B_filtered = FOREACH B_joined GENERATE B::Id,B::t1;

C_filtered =FOREACH C_joined GENERATE C::Id;