01-25-2018 03:56 AM - last edited on 01-25-2018 08:10 AM by cjervis
I want to perform a join based on Lavenshtein distance. I have 2 tables:
Data : Which is a CSV in HDFS files repository. one of DAta columns is disease description 15K rows.
df7_ct_map: a table I call from Hive. one if the columns is disease Indication 20K rows.
I m trying to join both tables by matching each description with the indication ( they are text descriptions of siknesses). Ideally they need to be the same but if both texts are different I wish to select match text containing the maximum of common words.
from pyspark.sql.functions import levenshtein joinedDF = df7_ct_map.join( Data, levenshtein(df7_ct_map("description"), Data("Indication")) < 3) joinedDF.show(10)
The problem is Data is a dataframe this is why I obtain the following error:
TypeError: 'DataFrame' object is not callable TypeError Traceback (most recent call last) in engine ----> 1 joinedDF = df7_ct_map.join( Data, levenshtein(df7_ct_map("description"), Data("Indication")) < 3) TypeError: 'DataFrame' object is not callable
Some advice ? Can I use Fuzzywuzzy package ? and how ?
Thanks a lot