Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Count values that are filtered - Apache PIG

avatar
Rising Star
Having this statement:
  1. Values = FILTER Input_Data BY Fields > 0 How to cont the number of records that was filtered and not? Many thanks!
1 ACCEPTED SOLUTION

avatar
Guru
hide-solution

This problem has been solved!

Want to get a detailed solution you have to login/registered on the community

Register/Login
8 REPLIES 8

avatar
Master Mentor

I can't think of a way to do it in one shot in Pig, if I was to write a Mapreduce job for the task, I'd implement custom counter so with every filter, custom counter gets updated https://diveintodata.org/2011/03/15/an-example-of-hadoop-mapreduce-counter/ you can also write a UDF and update custom counters, I haven't tried it but it's worth a shot http://stackoverflow.com/questions/14748120/how-to-increment-hadoop-counters-in-jython-udfs-in-pig

avatar
Guru
hide-solution

This problem has been solved!

Want to get a detailed solution you have to login/registered on the community

Register/Login

avatar
Master Mentor

🙂 what if your filter statement is a multiple of OR and AND ?

avatar
Guru

Good question: you can use multiple conditions in parens. eg

SPLIT A INTO X IF f1 < 7, Y IF f2 == 5, Z IF (f3 < 6 OR f5 ==0);

avatar
Master Mentor

Not the point, you execute COUNT on each filter condition, it's not efficient but does answer his question.

avatar
Guru

🙂 understood. One of those ease of development ( a few quick pig lines) vs highly optimized (custom m-r program) questions. Should still be relatively performant in pig. Above code I think is the only way to do it in pig.

avatar
Master Mentor

Yup, it's a choice of coding a few lines in Pig vs spending a couple of hours with Java.

avatar
Master Mentor

I might try writing a UDF with custom counters, sounds like an interesting challenge