Community Articles

Find and share helpful community-sourced technical articles.
Labels (1)
avatar
Master Mentor

UPDATE: I'm happy to report that my patch for PIG-4931 was accepted and merged to trunk.

I was browsing through Apache Pig Jiras and stumbled on Jira https://issues.apache.org/jira/browse/PIG-4931 requiring to document Pig "IN" operator. Turns out Pig had IN operator since days of 0.12 and no one had a chance to document it yet. The associated JIRA is https://issues.apache.org/jira/browse/PIG-3269. In this short article I will go over the IN operator and until I'm able to submit a patch to close out the ticket this should serve as its documentation. Now, IN operator in Pig works like in SQL. You provide a list of fields and it will return just those rows.

It is a lot more useful than for example

a = LOAD '1.txt' USING PigStorage(',') AS (i:int); 
b = FILTER a BY 
   (i == 1) OR
   (i == 22) OR
   (i == 333) OR
   (i == 4444) OR
   (i == 55555); 

You can rewrite the same statement as

a = LOAD '1.txt' USING PigStorage(',') AS (i:int); 
b = FILTER a BY i IN (1,22,333,4444,55555); 

The best thing about it is that it accepts more than just Integers, you can pass float, double, BigDecimal, BigInteger, bytearray and String. Let's review each one in detail

grunt> fs -cat data;
1,Christine,Romero,Female
2,Sara,Hansen,Female
3,Albert,Rogers,Male
4,Kimberly,Morrison,Female
5,Eugene,Baker,Male
6,Ann,Alexander,Female
7,Kathleen,Reed,Female
8,Todd,Scott,Male
9,Sharon,Mccoy,Female
10,Evelyn,Rice,Female 

Passing an integer to IN clause

A = load 'data' using PigStorage(',') AS (id:int, first:chararray, last:chararray, gender:chararray); 
X = FILTER A BY id IN (4, 6); 
dump X; 

(4,Kimberly,Morrison,Female)
(6,Ann,Alexander,Female) 

Passing a String

A = load 'data' using PigStorage(',') AS (id:chararray, first:chararray, last:chararray, gender:chararray); 
X = FILTER A BY id IN ('2', '4', '8'); 
dump X;

(2,Sara,Hansen,Female)
(4,Kimberly,Morrison,Female)
(8,Todd,Scott,Male) 

Passing a ByteArray

A = load 'data' using PigStorage(',') AS (id:bytearray, first:chararray, last:chararray, gender:chararray); 
X = FILTER A BY id IN ('1', '9'); 
dump X; 

(1,Christine,Romero,Female)
(9,Sharon,Mccoy,Female) 

Passing a BigInteger and using NOT operator, thereby negating the passed list of fields in the IN clause

A = load 'data' using PigStorage(',') AS (id:biginteger, first:chararray, last:chararray, gender:chararray); 
X = FILTER A BY NOT id IN (1, 3, 5, 7, 9); 
dump X;

(2,Sara,Hansen,Female)
(4,Kimberly,Morrison,Female)
(6,Ann,Alexander,Female)
(8,Todd,Scott,Male)
(10,Evelyn,Rice,Female)

Now I understand that most cool kids these days are using Spark; I strongly believe Pig has a place in any Big Data stack and it's livelihood depends on comprehensive and complete documentation. Happy learning!

3,620 Views