<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Pandas_udf with a tuple? (pyspark) in Support Questions</title>
    <link>https://community.cloudera.com/t5/Support-Questions/Pandas-udf-with-a-tuple-pyspark/m-p/190142#M152231</link>
    <description>&lt;P&gt;Hi!&lt;/P&gt;&lt;P&gt;I have a UDF that returns a tuple object:&lt;/P&gt;&lt;PRE&gt;stringSchema = StructType([
    StructField("fixedRoute", StringType(), False),
    StructField("accuracy", IntegerType(), False)])

def stringClassifier(x,y,z):
	... do some code
	return (value1,value2)
stringClassifier_udf = udf(stringClassifier, stringSchema)

&lt;/PRE&gt;&lt;P&gt;I use it in a dataframe like this:&lt;/P&gt;&lt;PRE&gt;df = df.select(['route', 'routestring', stringClassifier_udf(x,y,z).alias('newcol')])
&lt;/PRE&gt;&lt;P&gt;This works fine.  I later split that tuple into two distinct columns.  The UDF however does some string matching and is somewhat slow as it collects to the driver and then filters through a 10k item list to match a string.  (it does this for every row).  I've been reading about pandas_udf and Apache Arrow and was curious if running this same function would be possible with pandas_udf...  or if this would be help improve the performance..?  I think my hangup is that the return value of the UDF is a tuple item... here is my attempt:&lt;/P&gt;&lt;PRE&gt;from pyspark.sql.functions import pandas_udf, PandasUDFType

stringSchema = StructType([
    StructField("fixedRoute", StringType(), False),
    StructField("accuracy", IntegerType(), False)])

@pandas_udf(stringSchema)
def stringClassifier(x,y,z):
	... do some code
	return (value1,value2)

&lt;/PRE&gt;&lt;P&gt;Of course this is gives me errors and I've tried decorating the function with:  &lt;EM&gt;@pandas_udf('list', PandasUDFType.SCALAR)&lt;/EM&gt;&lt;/P&gt;&lt;P&gt;My errors looks like this:&lt;/P&gt;&lt;PRE&gt;NotImplementedError: Invalid returnType with scalar Pandas UDFs: StructType(List(StructField(fixedRoute,StringType,false),StructField(accuracy,IntegerType,false))) is not supported&lt;/PRE&gt;&lt;P&gt;Any idea if there is a way to make this work?&lt;/P&gt;&lt;P&gt;Thanks!&lt;/P&gt;</description>
    <pubDate>Wed, 11 Jul 2018 21:33:50 GMT</pubDate>
    <dc:creator>alexander_witte</dc:creator>
    <dc:date>2018-07-11T21:33:50Z</dc:date>
  </channel>
</rss>

