Cloudera Community

Support Questions

Find answers, ask questions, and share your expertise

Advanced Search

suri789

Explorer

I am working on the filter the duplicates and creating the autogenerated Id. Here is the code

df3=df.distinct df3.createOrReplaceTempView('df3') x = spark.sql('select row_number() over (order by ZipCode, District, Division, Region) As GeographyID, District, Division, Region, RegionName, ZipCode, City, State from df3')

x.show(50)

after filtering the duplicates I have a problem with city data with the same name, I am not able to get the distinct city values.

Here is the example for City Data

City

N. Plainfield

North Plainfield

how Can I deal with this kind of string value to get the distinct values?

699 Views

1 REPLY 1

RangaReddy

Master Collaborator

Hi @suri789

I think you haven't shared the full code, sample data and expected output to provide a solution. Please share the code proper format.

641 Views

Announcements

What's New @ Cloudera

[RELEASED] Cloudera Streams Messaging - Kubernetes Operator ...

What's New @ Cloudera

[RELEASED] Cloudera Streaming Analytics 1.14 for Cloudera Pu...

What's New @ Cloudera

Cloudera Data Engineering 1.23: Access Spark from Your Favor...

What's New @ Cloudera

HBase REST server scaling support is Generally Available

What's New @ Cloudera

New CLI option in the update-database command

Support Questions

How can I get the DISTINCT values with same name and starts with different string using Pyspark