Hi All,
I have defined the following function as an UDF in pyspark
def dist(x,y):
a=words(x,'event')
b=words(y,'proj_desc')
a['key']=1
b['key']=1
ab=pd.merge(a,b)
del ab['key']
ab['lv']=ab.apply(lambda x: lvd(x['event'],x['proj_desc']),axis=1)
score=float(np.mean(ab.groupby('event')['lv'].max()))
match=float(len(ab[ab['lv']==1]))
sim=float(match)/len(a)
return (x,score,match,sim);
The above function is being executed as below
part_concat = udf(dist, part_output)
part_output_df = gl_part_proj_df.select(part_concat(gl_part_proj_df['part_description'],gl_part_proj_df['proj_desc']).alias("part_output"))
My question is as I am using a pandas dataframe will it still execute in all the worker nodes or will it be brought to the driver?