Support Questions

Find answers, ask questions, and share your expertise

Count values that are filtered - Apache PIG

avatar
Rising Star
Having this statement:
  1. Values = FILTER Input_Data BY Fields > 0 How to cont the number of records that was filtered and not? Many thanks!
1 ACCEPTED SOLUTION

avatar
Guru

This should work

-- split into 2 datasets
SPLIT Input_data INTO A IF Field > 0, B if Field <= 0;

-- count > 0 records
A_grp = GROUP A ALL;
A_count = FOREACH A_grp GENERATE COUNT(A);

-- count <= 0 records
B_grp = GROUP B ALL;
B_count = FOREACH B_grp GENERATE COUNT(B);

See

View solution in original post

8 REPLIES 8

avatar
Master Mentor

I can't think of a way to do it in one shot in Pig, if I was to write a Mapreduce job for the task, I'd implement custom counter so with every filter, custom counter gets updated https://diveintodata.org/2011/03/15/an-example-of-hadoop-mapreduce-counter/ you can also write a UDF and update custom counters, I haven't tried it but it's worth a shot http://stackoverflow.com/questions/14748120/how-to-increment-hadoop-counters-in-jython-udfs-in-pig

avatar
Guru

This should work

-- split into 2 datasets
SPLIT Input_data INTO A IF Field > 0, B if Field <= 0;

-- count > 0 records
A_grp = GROUP A ALL;
A_count = FOREACH A_grp GENERATE COUNT(A);

-- count <= 0 records
B_grp = GROUP B ALL;
B_count = FOREACH B_grp GENERATE COUNT(B);

See

avatar
Master Mentor

🙂 what if your filter statement is a multiple of OR and AND ?

avatar
Guru

Good question: you can use multiple conditions in parens. eg

SPLIT A INTO X IF f1 < 7, Y IF f2 == 5, Z IF (f3 < 6 OR f5 ==0);

avatar
Master Mentor

Not the point, you execute COUNT on each filter condition, it's not efficient but does answer his question.

avatar
Guru

🙂 understood. One of those ease of development ( a few quick pig lines) vs highly optimized (custom m-r program) questions. Should still be relatively performant in pig. Above code I think is the only way to do it in pig.

avatar
Master Mentor

Yup, it's a choice of coding a few lines in Pig vs spending a couple of hours with Java.

avatar
Master Mentor

I might try writing a UDF with custom counters, sounds like an interesting challenge