Archives of Support Questions (Read Only)

This is an archived board for historical reference. Information and links may no longer be available or relevant
Announcements
This board is archived and read-only for historical reference. To ask a new question, please post a new topic on the appropriate active board.

Count values that are filtered - Apache PIG

avatar
Rising Star
Having this statement:
  1. Values = FILTER Input_Data BY Fields > 0 How to cont the number of records that was filtered and not? Many thanks!
1 ACCEPTED SOLUTION

avatar
Guru

This should work

-- split into 2 datasets
SPLIT Input_data INTO A IF Field > 0, B if Field <= 0;

-- count > 0 records
A_grp = GROUP A ALL;
A_count = FOREACH A_grp GENERATE COUNT(A);

-- count <= 0 records
B_grp = GROUP B ALL;
B_count = FOREACH B_grp GENERATE COUNT(B);

See

View solution in original post

8 REPLIES 8

avatar
Master Mentor

I can't think of a way to do it in one shot in Pig, if I was to write a Mapreduce job for the task, I'd implement custom counter so with every filter, custom counter gets updated https://diveintodata.org/2011/03/15/an-example-of-hadoop-mapreduce-counter/ you can also write a UDF and update custom counters, I haven't tried it but it's worth a shot http://stackoverflow.com/questions/14748120/how-to-increment-hadoop-counters-in-jython-udfs-in-pig

avatar
Guru

This should work

-- split into 2 datasets
SPLIT Input_data INTO A IF Field > 0, B if Field <= 0;

-- count > 0 records
A_grp = GROUP A ALL;
A_count = FOREACH A_grp GENERATE COUNT(A);

-- count <= 0 records
B_grp = GROUP B ALL;
B_count = FOREACH B_grp GENERATE COUNT(B);

See

avatar
Master Mentor

🙂 what if your filter statement is a multiple of OR and AND ?

avatar
Guru

Good question: you can use multiple conditions in parens. eg

SPLIT A INTO X IF f1 < 7, Y IF f2 == 5, Z IF (f3 < 6 OR f5 ==0);

avatar
Master Mentor

Not the point, you execute COUNT on each filter condition, it's not efficient but does answer his question.

avatar
Guru

🙂 understood. One of those ease of development ( a few quick pig lines) vs highly optimized (custom m-r program) questions. Should still be relatively performant in pig. Above code I think is the only way to do it in pig.

avatar
Master Mentor

Yup, it's a choice of coding a few lines in Pig vs spending a couple of hours with Java.

avatar
Master Mentor

I might try writing a UDF with custom counters, sounds like an interesting challenge