Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Pyspark Join with levenshtein distance

Highlighted

Pyspark Join with levenshtein distance

New Contributor

I want to perform a join based on Lavenshtein distance. I have 2 tables:
Data : Which is a CSV in HDFS files repository. one of DAta columns is disease description 15K rows.
df7_ct_map: a table I call from Hive. one if the columns is disease Indication 20K rows.
I m trying to join both tables by matching each description with the indication ( they are text descriptions of siknesses). Ideally they need to be the same but if both texts are different I wish to select match text containing the maximum of common words.

 

from pyspark.sql.functions import levenshtein 
joinedDF = df7_ct_map.join( Data, levenshtein(df7_ct_map("description"), 
Data("Indication")) < 3)
joinedDF.show(10)

 

The problem is Data is a dataframe this is why I obtain the following error:

 

TypeError: 'DataFrame' object is not callable
TypeError Traceback (most recent call last)
in engine
----> 1 joinedDF = df7_ct_map.join( Data, levenshtein(df7_ct_map("description"), Data("Indication")) < 3)

TypeError: 'DataFrame' object is not callable

 

Some advice ? Can I use Fuzzywuzzy package ? and how ?

Thanks a lot