Reply
Highlighted
New Contributor
Posts: 1
Registered: ‎01-25-2018

Pyspark Join with levenshtein distance

[ Edited ]

I want to perform a join based on Lavenshtein distance. I have 2 tables:
Data : Which is a CSV in HDFS files repository. one of DAta columns is disease description 15K rows.
df7_ct_map: a table I call from Hive. one if the columns is disease Indication 20K rows.
I m trying to join both tables by matching each description with the indication ( they are text descriptions of siknesses). Ideally they need to be the same but if both texts are different I wish to select match text containing the maximum of common words.

 

from pyspark.sql.functions import levenshtein 
joinedDF = df7_ct_map.join( Data, levenshtein(df7_ct_map("description"), 
Data("Indication")) < 3)
joinedDF.show(10)

 

The problem is Data is a dataframe this is why I obtain the following error:

 

TypeError: 'DataFrame' object is not callable
TypeError Traceback (most recent call last)
in engine
----> 1 joinedDF = df7_ct_map.join( Data, levenshtein(df7_ct_map("description"), Data("Indication")) < 3)

TypeError: 'DataFrame' object is not callable

 

Some advice ? Can I use Fuzzywuzzy package ? and how ?

Thanks a lot

Announcements