Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Please see the Cloudera blog for information on the Cloudera Response to CVE-2021-4428

Pig: Streaming through python

Explorer

Hi,

I have a small Voters list(name,gender,place,age) where I wanted to eliminate the voters whose age is <= 20. I wanted to try streaming in pig.

When I run the dump on stream its fails and is unable to idenetify python commands. I have attached python script, input data file, pig script and log file. Could you guide where should I install the python in Sandbox. Thank you.

Input:

AAA,Female,Blr,40 
BBB,Female,London,35
YYY,Female,Pondy,12
JJJ,Male,London,4
SSS,Female,Pondy,30

pig script in tez_local mode:

grunt> Voters = LOAD 'file:///user/revathy/pig/Voters.txt' USING PigStorage(',') AS (VoterName:chararray,Gender:chararray,Place:chararray,Age:int); 
 
grunt> Eligible = STREAM Voters THROUGH `/root/revathy/pig/hello.py` AS (VoterName:chararray,Gender:chararray,Place:chararray,Age:int);

Python script:(Tested in Python editor)

import sys 
THRESHOLD = 20 def filterVal(line,val4): 
if int(val4) > THRESHOLD: 
          sys.stdout.writelines(line) 
         
        return 
try: 
         
    for line in sys.stdin.readlines(): 
                 val1,val2,val3,val4 = str(line).split(",") 
                 filterVal(line,val4) 
except: 
         
print "Error in try block"

Log:

/root/revathy/pig/hello.py: 
line 1: import: command not found                                                                                        
/root/revathy/pig/hello.py:
line 2: THRESHOLD: command not found                                                                                    
/root/revathy/pig/hello.py: 
line 3:                                                                                                                  
: command not found              
1 ACCEPTED SOLUTION

Accepted Solutions

Mentor

You did not include the python interpreter line in your python script and it has difficulty understanding its python. For what you're trying to achieve, you can skip streaming and just use Pig built-in filter function. It will perform better than streaming. http://pig.apache.org/docs/r0.15.0/

SSN_NAME = load 'students.txt' using PigStorage() as (ssn:long, name:chararray);

/* do a left outer join of SSN with SSN_Name */
X = JOIN SSN by ssn LEFT OUTER, SSN_NAME by ssn;

/* only keep those ssn's for which there is no name */
Y = filter X by IsEmpty(SSN_NAME);

View solution in original post

5 REPLIES 5

Mentor

You did not include the python interpreter line in your python script and it has difficulty understanding its python. For what you're trying to achieve, you can skip streaming and just use Pig built-in filter function. It will perform better than streaming. http://pig.apache.org/docs/r0.15.0/

SSN_NAME = load 'students.txt' using PigStorage() as (ssn:long, name:chararray);

/* do a left outer join of SSN with SSN_Name */
X = JOIN SSN by ssn LEFT OUTER, SSN_NAME by ssn;

/* only keep those ssn's for which there is no name */
Y = filter X by IsEmpty(SSN_NAME);

View solution in original post

Explorer

Thank you for providing an alternation approach. I am learning Pig and would like to try the stream command - see how to run python in pig.

Is this the line, to be added as first line so that execution engine understands its python? #! /usr/bin/env python I tried but still get the same error. Could you please help. Thank you!!!

Mentor

Checkout my UDF examples using streaming https://github.com/dbist/pig/tree/master/udfs

specifically formathtml.pig script and it's associated UDF written in python

New Contributor

Do you know if there is a way to specify a python virtual environment for streaming_python to use instead of it using the base python installation?

Explorer

Thank you. Its a good simple example for me to understand.