question Pandas_udf with a tuple? (pyspark) in Support Questions

question Pandas_udf with a tuple? (pyspark) in Support Questions https://community.cloudera.com/t5/Support-Questions/Pandas-udf-with-a-tuple-pyspark/m-p/190142#M152231 Hi!I have a UDF that returns a tuple object:<PRE>stringSchema = StructType([ StructField("fixedRoute", StringType(), False), StructField("accuracy", IntegerType(), False)]) def stringClassifier(x,y,z): ... do some code return (value1,value2) stringClassifier_udf = udf(stringClassifier, stringSchema) </PRE>I use it in a dataframe like this:<PRE>df = df.select(['route', 'routestring', stringClassifier_udf(x,y,z).alias('newcol')]) </PRE>This works fine. I later split that tuple into two distinct columns. The UDF however does some string matching and is somewhat slow as it collects to the driver and then filters through a 10k item list to match a string. (it does this for every row). I've been reading about pandas_udf and Apache Arrow and was curious if running this same function would be possible with pandas_udf... or if this would be help improve the performance..? I think my hangup is that the return value of the UDF is a tuple item... here is my attempt:<PRE>from pyspark.sql.functions import pandas_udf, PandasUDFType stringSchema = StructType([ StructField("fixedRoute", StringType(), False), StructField("accuracy", IntegerType(), False)]) @pandas_udf(stringSchema) def stringClassifier(x,y,z): ... do some code return (value1,value2) </PRE>Of course this is gives me errors and I've tried decorating the function with: @pandas_udf('list', PandasUDFType.SCALAR)My errors looks like this:<PRE>NotImplementedError: Invalid returnType with scalar Pandas UDFs: StructType(List(StructField(fixedRoute,StringType,false),StructField(accuracy,IntegerType,false))) is not supported</PRE>Any idea if there is a way to make this work?Thanks! Wed, 11 Jul 2018 21:33:50 GMT alexander_witte 2018-07-11T21:33:50Z