Support Questions
Find answers, ask questions, and share your expertise

How can I get the DISTINCT values with same name and starts with different string using Pyspark

New Contributor
0

I am working on the filter the duplicates and creating the autogenerated Id. Here is the code

 

df3=df.distinct df3.createOrReplaceTempView('df3') x = spark.sql('select row_number() over (order by ZipCode, District, Division, Region) As GeographyID, District, Division, Region, RegionName, ZipCode, City, State from df3')

x.show(50)

 

after filtering the duplicates I have a problem with city data with the same name, I am not able to get the distinct city values.

Here is the example for City Data

City

N. Plainfield   

North Plainfield

how Can I deal with this kind of string value to get the distinct values?

 

0 REPLIES 0
Take a Tour of the Community
Don't have an account?
Your experience may be limited. Sign in to explore more.