Support Questions

Stewart12586 · ‎09-09-2016

Having this statement:

Values = FILTER Input_Data BY Fields > 0 How to cont the number of records that was filtered and not? Many thanks!

gkeys · ‎09-09-2016

This should work

-- split into 2 datasets
SPLIT Input_data INTO A IF Field > 0, B if Field <= 0;

-- count > 0 records
A_grp = GROUP A ALL;
A_count = FOREACH A_grp GENERATE COUNT(A);

-- count <= 0 records
B_grp = GROUP B ALL;
B_count = FOREACH B_grp GENERATE COUNT(B);

See

https://pig.apache.org/docs/r0.7.0/piglatin_ref2.html#SPLIT
http://pig.apache.org/docs/r0.9.2/func.html#count (note the use of ALL here instead of a particular field)
http://www.tutorialspoint.com/apache_pig/apache_pig_count.htm

View solution in original post

aervits · ‎09-09-2016

I can't think of a way to do it in one shot in Pig, if I was to write a Mapreduce job for the task, I'd implement custom counter so with every filter, custom counter gets updated https://diveintodata.org/2011/03/15/an-example-of-hadoop-mapreduce-counter/ you can also write a UDF and update custom counters, I haven't tried it but it's worth a shot http://stackoverflow.com/questions/14748120/how-to-increment-hadoop-counters-in-jython-udfs-in-pig

gkeys · ‎09-09-2016

This should work

-- split into 2 datasets
SPLIT Input_data INTO A IF Field > 0, B if Field <= 0;

-- count > 0 records
A_grp = GROUP A ALL;
A_count = FOREACH A_grp GENERATE COUNT(A);

-- count <= 0 records
B_grp = GROUP B ALL;
B_count = FOREACH B_grp GENERATE COUNT(B);

See

https://pig.apache.org/docs/r0.7.0/piglatin_ref2.html#SPLIT
http://pig.apache.org/docs/r0.9.2/func.html#count (note the use of ALL here instead of a particular field)
http://www.tutorialspoint.com/apache_pig/apache_pig_count.htm

aervits · ‎09-09-2016

🙂 what if your filter statement is a multiple of OR and AND ?

gkeys · ‎09-09-2016

Good question: you can use multiple conditions in parens. eg

SPLIT A INTO X IF f1 < 7, Y IF f2 == 5, Z IF (f3 < 6 OR f5 ==0);

aervits · ‎09-09-2016

Not the point, you execute COUNT on each filter condition, it's not efficient but does answer his question.

gkeys · ‎09-09-2016

🙂 understood. One of those ease of development ( a few quick pig lines) vs highly optimized (custom m-r program) questions. Should still be relatively performant in pig. Above code I think is the only way to do it in pig.

aervits · ‎09-09-2016

Yup, it's a choice of coding a few lines in Pig vs spending a couple of hours with Java.

aervits · ‎09-09-2016

I might try writing a UDF with custom counters, sounds like an interesting challenge

Cloudera Community

Support Questions

Count values that are filtered - Apache PIG