Created 10-29-2024 10:10 AM
pig-0.17.0bin/pig -x local
very basic UDF file:
#!/usr/bin/python3
from pig_util import outputSchema
@outputSchema("as:int")
def square(num):
if num == None:
return None
return ((num) * (num))
@outputSchema("word:chararray")
def concat(word):
return word + word
Exceedingly simple pig script:
REGISTER '/home/scs/woodcock/SD411/lab_udf/test.py' USING org.apache.pig.scripting.streaming.Python.PythonScriptEngine AS myFuncs;
A = LOAD '/home/scs/woodcock/SD411/DATA/accident.csv' USING PigStorage(',') AS (state:int,name:chararray);
B = FOREACH A GENERATE myFuncs.square(state) AS state, name;
If I do a "DUMP A" I get exactly what I would expect.
But, on a "DUMP B", I get a failed job:
java.lang.Exception: org.apache.pig.impl.streaming.StreamingUDFException: LINE :
at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:522)
Caused by: org.apache.pig.impl.streaming.StreamingUDFException: LINE :
at org.apache.pig.impl.builtin.StreamingUDF$ProcessErrorThread.run(StreamingUDF.java:506)
grunt> Exception in thread "Thread-82" java.lang.NullPointerException: Cannot invoke "java.util.concurrent.BlockingQueue.put(Object)" because the return value of "org.apache.pig.impl.builtin.StreamingUDF.access$500(org.apache.pig.impl.builtin.StreamingUDF)" is null
at org.apache.pig.impl.builtin.StreamingUDF$ProcessOutputThread.run(StreamingUDF.java:471)
2024-10-29 13:02:15,296 [communication thread] INFO org.apache.hadoop.mapred.LocalJobRunner - map > map
?
Created 11-05-2024 12:09 PM
Here is the actual fix (it's actually quite loony): don't wrap the name of the file in the REGISTER statement with single quotes. That's it.
Catastrophic problems here:
1) Obviously not backwards compatible.
2) If this is a problem, why not just indicate that a) the format is wrong, or b) that a path that started with a single-quote did not yield a valid python file or c) anything understandable instead of getting in the middle of the M/R computation and throwing wacky (mkey? nullPointer) errors.
Created 10-29-2024 11:40 AM
@mew Welcome to the Cloudera Community!
To help you get the best possible solution, I have tagged our MapReduce experts @Stella Tang @vchalla @jeniferA who may be able to assist you further.
Please keep us updated on your post, and we hope you find a satisfactory solution to your query.
Regards,
Diana Torres,Created 10-31-2024 08:46 AM
Honestly, at this point, I would probably accept any (less trivial than a "HelloWorld"--that is something that actually computes, not just returns a fixed string) Python UDF and the script that will work in pig17. I feel like I'm just cutting and pasting the standard documented examples, and that's not close to working, which isn't giving me a great feeling.
Created 11-01-2024 10:05 AM
I also note that I can get Java UDFs to work; so its not a general UDF problem...it's something specific to Python.
Created 11-04-2024 06:06 AM
Hmmm...so I tried rolling back to Pig 13, and somewhat troubling...but that totally worked. On multiple different machines. Perhaps something didn't get tested real well before release?
Created 11-05-2024 12:09 PM
Here is the actual fix (it's actually quite loony): don't wrap the name of the file in the REGISTER statement with single quotes. That's it.
Catastrophic problems here:
1) Obviously not backwards compatible.
2) If this is a problem, why not just indicate that a) the format is wrong, or b) that a path that started with a single-quote did not yield a valid python file or c) anything understandable instead of getting in the middle of the M/R computation and throwing wacky (mkey? nullPointer) errors.