Support Questions

alexander_witte · ‎07-11-2018

Hi!

I have a UDF that returns a tuple object:

stringSchema = StructType([
    StructField("fixedRoute", StringType(), False),
    StructField("accuracy", IntegerType(), False)])

def stringClassifier(x,y,z):
	... do some code
	return (value1,value2)
stringClassifier_udf = udf(stringClassifier, stringSchema)

I use it in a dataframe like this:

df = df.select(['route', 'routestring', stringClassifier_udf(x,y,z).alias('newcol')])

This works fine. I later split that tuple into two distinct columns. The UDF however does some string matching and is somewhat slow as it collects to the driver and then filters through a 10k item list to match a string. (it does this for every row). I've been reading about pandas_udf and Apache Arrow and was curious if running this same function would be possible with pandas_udf... or if this would be help improve the performance..? I think my hangup is that the return value of the UDF is a tuple item... here is my attempt:

from pyspark.sql.functions import pandas_udf, PandasUDFType

stringSchema = StructType([
    StructField("fixedRoute", StringType(), False),
    StructField("accuracy", IntegerType(), False)])

@pandas_udf(stringSchema)
def stringClassifier(x,y,z):
	... do some code
	return (value1,value2)

Of course this is gives me errors and I've tried decorating the function with: @pandas_udf('list', PandasUDFType.SCALAR)

My errors looks like this:

NotImplementedError: Invalid returnType with scalar Pandas UDFs: StructType(List(StructField(fixedRoute,StringType,false),StructField(accuracy,IntegerType,false))) is not supported

Any idea if there is a way to make this work?

Thanks!

o912451 · ‎07-12-2018

It looks like you are using a scalar pandas_udf type, which doesn't support returning structs currently. I believe the return type you want is an array of strings, which is supported, so this should work. Try this:

@pandas_udf("array<string>")
def stringClassifier(x,y,z):
        # return a pandas series of a list of strings, that is same length as input - for example
        s = pd.Series([[u"a", u"b"]] * len(x))
	return s

If you are using Python 2, make sure your strings are in unicode otherwise they might get interpreted as bytes. Hope that helps!

alexander_witte · ‎07-12-2018

Hey Bryan thanks so much for taking the time! I think I'm almost there! The hint about the unicode issue helping me get past the first slew of errors. I seem to be running into a length one now however:

@pandas_udf("array<string>")
def stringClassifier(lookupstring, first, last):

    lookupstring = lookupstring.to_string().encode("utf-8")
    first = first.to_string().encode("utf-8")
    last = last.to_string().encode("utf-8")
    
	#this part takes the 3 strings above and reaches out to another library to do a string match
    result = process.extract(lookupstring, lookup_list, limit=4000)
    match_list = [item for item in result if item[0].startswith(first) and item[0].endswith(last)]
    result2 = process.extractOne(lookupstring, match_list)

    if result2 is not None and result2[0][1] > 75:
        answer = pd.Series(list(result2[0]))
        return answer
    else:
        fail = ["N/A","0"]
        return pd.Series(fail)

RuntimeError: Result vector from pandas_udf was not the required length: expected 1, got 2

I'm initially passing three strings as variables to the function which then get passed to another library. The result is a tuple which I covert to a list then to a pandas Series object. I'm curious how I can make a 2 item array object a length of 1 ..? I'm obviously missing some basics here.

@Bryan C

Cloudera Community

Support Questions

Pandas_udf with a tuple? (pyspark)