Created 09-09-2016 06:18 PM
Created 09-09-2016 08:45 PM
This should work
-- split into 2 datasets SPLIT Input_data INTO A IF Field > 0, B if Field <= 0; -- count > 0 records A_grp = GROUP A ALL; A_count = FOREACH A_grp GENERATE COUNT(A); -- count <= 0 records B_grp = GROUP B ALL; B_count = FOREACH B_grp GENERATE COUNT(B);
See
Created 09-09-2016 08:29 PM
I can't think of a way to do it in one shot in Pig, if I was to write a Mapreduce job for the task, I'd implement custom counter so with every filter, custom counter gets updated https://diveintodata.org/2011/03/15/an-example-of-hadoop-mapreduce-counter/ you can also write a UDF and update custom counters, I haven't tried it but it's worth a shot http://stackoverflow.com/questions/14748120/how-to-increment-hadoop-counters-in-jython-udfs-in-pig
Created 09-09-2016 08:45 PM
This should work
-- split into 2 datasets SPLIT Input_data INTO A IF Field > 0, B if Field <= 0; -- count > 0 records A_grp = GROUP A ALL; A_count = FOREACH A_grp GENERATE COUNT(A); -- count <= 0 records B_grp = GROUP B ALL; B_count = FOREACH B_grp GENERATE COUNT(B);
See
Created 09-09-2016 08:50 PM
🙂 what if your filter statement is a multiple of OR and AND ?
Created 09-09-2016 08:54 PM
Good question: you can use multiple conditions in parens. eg
SPLIT A INTO X IF f1 < 7, Y IF f2 == 5, Z IF (f3 < 6 OR f5 ==0);
Created 09-09-2016 09:07 PM
Not the point, you execute COUNT on each filter condition, it's not efficient but does answer his question.
Created 09-09-2016 09:24 PM
🙂 understood. One of those ease of development ( a few quick pig lines) vs highly optimized (custom m-r program) questions. Should still be relatively performant in pig. Above code I think is the only way to do it in pig.
Created 09-09-2016 09:45 PM
Yup, it's a choice of coding a few lines in Pig vs spending a couple of hours with Java.
Created 09-09-2016 09:48 PM
I might try writing a UDF with custom counters, sounds like an interesting challenge