- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
How to remove the space and dots and convert into lowercase in Pyspark
- Labels:
-
Apache Spark
Created ‎06-09-2022 06:54 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have a pyspark dataframe with names like
N. Plainfield
North Plainfield
West Home Land
NEWYORK
newyork
So. Plainfield
S. Plaindield
Some of them contain dots and spaces between initials and some do not. How can they be converted to:
n Plainfield
north plainfield
west homeland
newyork
newyork
so plainfield
s plainfield
(with no dots and spaces between initials and 1 space between initials and name)
I tried using the following but it only replaces dots and doesn't remove spaces between initials:
names_modified = names.withColumn("name_clean", regexp_replace("name", r"\.",""))
After removing the whitespaces and dots is there any way get the distinct values.
like this.
north plainfield
west homeland
newyork
so plainfield
Created ‎06-15-2022 10:41 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi @suri789 Can you try this below and share your feedback?
>>> df.show()
+----------------+
| value |
+----------------+
| N. Plainfield|
|North Plainfield|
| West Home Land|
| NEWYORK|
| newyork|
| So. Plainfield|
| S. Plaindield|
| s Plaindield|
|North Plainfield|
+----------------+
>>> from pyspark.sql.functions import regexp_replace, lower
>>> df_tmp=df.withColumn('value', regexp_replace('value', r'\.',''))
>>> df_tmp.withColumn('value', lower(df_tmp.value)).distinct().show()
+----------------+
| value |
+----------------+
| s plaindield|
| n plainfield|
| west home land|
| newyork|
| so plainfield|
|north plainfield|
+----------------+
Created ‎06-15-2022 10:41 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi @suri789 Can you try this below and share your feedback?
>>> df.show()
+----------------+
| value |
+----------------+
| N. Plainfield|
|North Plainfield|
| West Home Land|
| NEWYORK|
| newyork|
| So. Plainfield|
| S. Plaindield|
| s Plaindield|
|North Plainfield|
+----------------+
>>> from pyspark.sql.functions import regexp_replace, lower
>>> df_tmp=df.withColumn('value', regexp_replace('value', r'\.',''))
>>> df_tmp.withColumn('value', lower(df_tmp.value)).distinct().show()
+----------------+
| value |
+----------------+
| s plaindield|
| n plainfield|
| west home land|
| newyork|
| so plainfield|
|north plainfield|
+----------------+
Created ‎06-30-2022 05:05 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks jagadeesan,
But Still your getting the duplicate values
Created ‎06-30-2022 05:05 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
so plainfield, s plainfiled both are same
Created ‎06-30-2022 07:27 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi @suri789 these both are different values, I didn't see any duplicate in these.
so plainfield
s plainfiled
Also from the output, I didn't see any duplicate values, all are distinct by the values..!
+----------------+
| value |
+----------------+
| s plaindield|
| n plainfield|
| west home land|
| newyork|
| so plainfield|
|north plainfield|
+----------------+
Please note: "n plainfield & north plainfield or s plainfield & so plainfield" are different values, because we didn't write any custom logic like 'n' means 'north' or 's' means 'so'.
