UPDATE: I'm happy to report that my patch for PIG-4931 was accepted and merged to trunk.
I was browsing through Apache Pig Jiras and stumbled on Jira https://issues.apache.org/jira/browse/PIG-4931 requiring to document Pig "IN" operator. Turns out Pig had IN operator since days of 0.12 and no one had a chance
to document it yet. The associated JIRA is https://issues.apache.org/jira/browse/PIG-3269. In this short article I will go over the IN operator and until I'm able to submit a patch to close
out the ticket this should serve as its documentation. Now, IN operator in Pig works like in SQL. You provide a
list of fields and it will return just those rows.
It is a lot more useful than for example
a = LOAD '1.txt' USING PigStorage(',') AS (i:int);
b = FILTER a BY
(i == 1) OR
(i == 22) OR
(i == 333) OR
(i == 4444) OR
(i == 55555);
You can rewrite the same statement as
a = LOAD '1.txt' USING PigStorage(',') AS (i:int);
b = FILTER a BY i IN (1,22,333,4444,55555);
The best thing about it is that it accepts more than just Integers, you can pass float, double, BigDecimal, BigInteger, bytearray and String.
Let's review each one in detail
A = load 'data' using PigStorage(',') AS (id:int, first:chararray, last:chararray, gender:chararray);
X = FILTER A BY id IN (4, 6);
dump X;
(4,Kimberly,Morrison,Female)
(6,Ann,Alexander,Female)
Passing a String
A = load 'data' using PigStorage(',') AS (id:chararray, first:chararray, last:chararray, gender:chararray);
X = FILTER A BY id IN ('2', '4', '8');
dump X;
(2,Sara,Hansen,Female)
(4,Kimberly,Morrison,Female)
(8,Todd,Scott,Male)
Passing a ByteArray
A = load 'data' using PigStorage(',') AS (id:bytearray, first:chararray, last:chararray, gender:chararray);
X = FILTER A BY id IN ('1', '9');
dump X;
(1,Christine,Romero,Female)
(9,Sharon,Mccoy,Female)
Passing a BigInteger and using NOT operator, thereby negating the passed list of fields in the IN clause
A = load 'data' using PigStorage(',') AS (id:biginteger, first:chararray, last:chararray, gender:chararray);
X = FILTER A BY NOT id IN (1, 3, 5, 7, 9);
dump X;
(2,Sara,Hansen,Female)
(4,Kimberly,Morrison,Female)
(6,Ann,Alexander,Female)
(8,Todd,Scott,Male)
(10,Evelyn,Rice,Female)
Now I understand that most cool kids these days are using Spark; I strongly believe Pig has a place in any Big Data stack and it's livelihood depends on comprehensive and complete documentation. Happy learning!