Created 04-20-2016 02:26 AM
Hi,
I have a small Voters list(name,gender,place,age) where I wanted to eliminate the voters whose age is <= 20. I wanted to try streaming in pig.
When I run the dump on stream its fails and is unable to idenetify python commands. I have attached python script, input data file, pig script and log file. Could you guide where should I install the python in Sandbox. Thank you.
Input:
AAA,Female,Blr,40 BBB,Female,London,35 YYY,Female,Pondy,12 JJJ,Male,London,4 SSS,Female,Pondy,30
pig script in tez_local mode:
grunt> Voters = LOAD 'file:///user/revathy/pig/Voters.txt' USING PigStorage(',') AS (VoterName:chararray,Gender:chararray,Place:chararray,Age:int); grunt> Eligible = STREAM Voters THROUGH `/root/revathy/pig/hello.py` AS (VoterName:chararray,Gender:chararray,Place:chararray,Age:int);
Python script:(Tested in Python editor)
import sys THRESHOLD = 20 def filterVal(line,val4): if int(val4) > THRESHOLD: sys.stdout.writelines(line) return try: for line in sys.stdin.readlines(): val1,val2,val3,val4 = str(line).split(",") filterVal(line,val4) except: print "Error in try block"
Log:
/root/revathy/pig/hello.py: line 1: import: command not found /root/revathy/pig/hello.py: line 2: THRESHOLD: command not found /root/revathy/pig/hello.py: line 3: : command not found
Created 04-20-2016 10:18 AM
You did not include the python interpreter line in your python script and it has difficulty understanding its python. For what you're trying to achieve, you can skip streaming and just use Pig built-in filter function. It will perform better than streaming. http://pig.apache.org/docs/r0.15.0/
SSN_NAME = load 'students.txt' using PigStorage() as (ssn:long, name:chararray); /* do a left outer join of SSN with SSN_Name */ X = JOIN SSN by ssn LEFT OUTER, SSN_NAME by ssn; /* only keep those ssn's for which there is no name */ Y = filter X by IsEmpty(SSN_NAME);
Created 04-20-2016 10:18 AM
You did not include the python interpreter line in your python script and it has difficulty understanding its python. For what you're trying to achieve, you can skip streaming and just use Pig built-in filter function. It will perform better than streaming. http://pig.apache.org/docs/r0.15.0/
SSN_NAME = load 'students.txt' using PigStorage() as (ssn:long, name:chararray); /* do a left outer join of SSN with SSN_Name */ X = JOIN SSN by ssn LEFT OUTER, SSN_NAME by ssn; /* only keep those ssn's for which there is no name */ Y = filter X by IsEmpty(SSN_NAME);
Created 04-20-2016 11:17 PM
Thank you for providing an alternation approach. I am learning Pig and would like to try the stream command - see how to run python in pig.
Is this the line, to be added as first line so that execution engine understands its python? #! /usr/bin/env python I tried but still get the same error. Could you please help. Thank you!!!
Created 04-21-2016 12:04 AM
Checkout my UDF examples using streaming https://github.com/dbist/pig/tree/master/udfs
specifically formathtml.pig script and it's associated UDF written in python
Created 06-10-2019 05:59 PM
Do you know if there is a way to specify a python virtual environment for streaming_python to use instead of it using the base python installation?
Created 04-21-2016 04:31 PM
Thank you. Its a good simple example for me to understand.