I am working on the filter the duplicates and creating the autogenerated Id. Here is the code
df3=df.distinct df3.createOrReplaceTempView('df3') x = spark.sql('select row_number() over (order by ZipCode, District, Division, Region) As GeographyID, District, Division, Region, RegionName, ZipCode, City, State from df3')
x.show(50)
after filtering the duplicates I have a problem with city data with the same name, I am not able to get the distinct city values.
Here is the example for City Data
City
N. Plainfield
North Plainfield
how Can I deal with this kind of string value to get the distinct values?