Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

How can I get the DISTINCT values with same name and starts with different string using Pyspark

avatar
Explorer
0

I am working on the filter the duplicates and creating the autogenerated Id. Here is the code

 

df3=df.distinct df3.createOrReplaceTempView('df3') x = spark.sql('select row_number() over (order by ZipCode, District, Division, Region) As GeographyID, District, Division, Region, RegionName, ZipCode, City, State from df3')

x.show(50)

 

after filtering the duplicates I have a problem with city data with the same name, I am not able to get the distinct city values.

Here is the example for City Data

City

N. Plainfield   

North Plainfield

how Can I deal with this kind of string value to get the distinct values?

 

1 REPLY 1

avatar
Super Collaborator

Hi @suri789 

 

I think you haven't shared the full code, sample data and expected output to provide a solution. Please share the code proper format.