Community Articles
Find and share helpful community-sourced technical articles
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.
Labels (1)
Mentor

UPDATE: I'm happy to report that my patch for PIG-4931 was accepted and merged to trunk.

I was browsing through Apache Pig Jiras and stumbled on Jira https://issues.apache.org/jira/browse/PIG-4931 requiring to document Pig "IN" operator. Turns out Pig had IN operator since days of 0.12 and no one had a chance to document it yet. The associated JIRA is https://issues.apache.org/jira/browse/PIG-3269. In this short article I will go over the IN operator and until I'm able to submit a patch to close out the ticket this should serve as its documentation. Now, IN operator in Pig works like in SQL. You provide a list of fields and it will return just those rows.

It is a lot more useful than for example

a = LOAD '1.txt' USING PigStorage(',') AS (i:int); 
b = FILTER a BY 
   (i == 1) OR
   (i == 22) OR
   (i == 333) OR
   (i == 4444) OR
   (i == 55555); 

You can rewrite the same statement as

a = LOAD '1.txt' USING PigStorage(',') AS (i:int); 
b = FILTER a BY i IN (1,22,333,4444,55555); 

The best thing about it is that it accepts more than just Integers, you can pass float, double, BigDecimal, BigInteger, bytearray and String. Let's review each one in detail

grunt> fs -cat data;
1,Christine,Romero,Female
2,Sara,Hansen,Female
3,Albert,Rogers,Male
4,Kimberly,Morrison,Female
5,Eugene,Baker,Male
6,Ann,Alexander,Female
7,Kathleen,Reed,Female
8,Todd,Scott,Male
9,Sharon,Mccoy,Female
10,Evelyn,Rice,Female 

Passing an integer to IN clause

A = load 'data' using PigStorage(',') AS (id:int, first:chararray, last:chararray, gender:chararray); 
X = FILTER A BY id IN (4, 6); 
dump X; 

(4,Kimberly,Morrison,Female)
(6,Ann,Alexander,Female) 

Passing a String

A = load 'data' using PigStorage(',') AS (id:chararray, first:chararray, last:chararray, gender:chararray); 
X = FILTER A BY id IN ('2', '4', '8'); 
dump X;

(2,Sara,Hansen,Female)
(4,Kimberly,Morrison,Female)
(8,Todd,Scott,Male) 

Passing a ByteArray

A = load 'data' using PigStorage(',') AS (id:bytearray, first:chararray, last:chararray, gender:chararray); 
X = FILTER A BY id IN ('1', '9'); 
dump X; 

(1,Christine,Romero,Female)
(9,Sharon,Mccoy,Female) 

Passing a BigInteger and using NOT operator, thereby negating the passed list of fields in the IN clause

A = load 'data' using PigStorage(',') AS (id:biginteger, first:chararray, last:chararray, gender:chararray); 
X = FILTER A BY NOT id IN (1, 3, 5, 7, 9); 
dump X;

(2,Sara,Hansen,Female)
(4,Kimberly,Morrison,Female)
(6,Ann,Alexander,Female)
(8,Todd,Scott,Male)
(10,Evelyn,Rice,Female)

Now I understand that most cool kids these days are using Spark; I strongly believe Pig has a place in any Big Data stack and it's livelihood depends on comprehensive and complete documentation. Happy learning!

2,529 Views
Don't have an account?
Coming from Hortonworks? Activate your account here
Version history
Revision #:
1 of 1
Last update:
‎08-08-2016 04:54 PM
Updated by:
 
Contributors
Top Kudoed Authors