<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Re: How to remove the space and dots and convert into lowercase in Pyspark in Support Questions</title>
    <link>https://community.cloudera.com/t5/Support-Questions/How-to-remove-the-space-and-dots-and-convert-into-lowercase/m-p/346616#M234943</link>
    <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.cloudera.com/t5/user/viewprofilepage/user-id/98475"&gt;@suri789&lt;/a&gt;&amp;nbsp;these both are different values, I didn't see any duplicate in these.&amp;nbsp;&lt;/P&gt;&lt;PRE&gt;so plainfield&lt;/PRE&gt;&lt;PRE&gt;s plainfiled&lt;/PRE&gt;&lt;P&gt;&amp;nbsp;Also from the output, I didn't see any&amp;nbsp;&lt;SPAN&gt;duplicate values, all are distinct by the values..!&amp;nbsp;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;&lt;PRE&gt;+----------------+&lt;BR /&gt;| value          |&lt;BR /&gt;+----------------+&lt;BR /&gt;| s plaindield|&lt;BR /&gt;| n plainfield|&lt;BR /&gt;| west home land|&lt;BR /&gt;| newyork|&lt;BR /&gt;| so plainfield|&lt;BR /&gt;|north plainfield|&lt;BR /&gt;+----------------+&lt;/PRE&gt;&lt;P&gt;&amp;nbsp;&lt;BR /&gt;Please note: "n&amp;nbsp;plainfield &amp;amp; north plainfield or s plainfield &amp;amp; so plainfield" are different values, because we didn't write any custom logic like 'n' means 'north' or 's' means 'so'.&amp;nbsp; &amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Thu, 30 Jun 2022 14:27:06 GMT</pubDate>
    <dc:creator>jagadeesan</dc:creator>
    <dc:date>2022-06-30T14:27:06Z</dc:date>
    <item>
      <title>How to remove the space and dots and convert into lowercase in Pyspark</title>
      <link>https://community.cloudera.com/t5/Support-Questions/How-to-remove-the-space-and-dots-and-convert-into-lowercase/m-p/345327#M234487</link>
      <description>&lt;P&gt;I have a pyspark dataframe with names like&lt;/P&gt;&lt;P&gt;N. Plainfield&lt;BR /&gt;North Plainfield&lt;BR /&gt;West Home Land&lt;BR /&gt;NEWYORK&lt;BR /&gt;newyork&lt;BR /&gt;So. Plainfield&lt;BR /&gt;S. Plaindield&lt;/P&gt;&lt;P&gt;Some of them contain dots and spaces between initials and some do not. How can they be converted to:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;n Plainfield&lt;BR /&gt;north plainfield&lt;BR /&gt;west homeland&lt;BR /&gt;newyork&lt;BR /&gt;newyork&lt;BR /&gt;so plainfield&lt;BR /&gt;s plainfield&lt;/P&gt;&lt;P&gt;(with no dots and spaces between initials and 1 space between initials and name)&lt;/P&gt;&lt;P&gt;I tried using the following but it only replaces dots and doesn't remove spaces between initials:&lt;/P&gt;&lt;P&gt;names_modified = names.withColumn("name_clean", regexp_replace("name", r"\.",""))&lt;/P&gt;&lt;P&gt;After removing the whitespaces and dots is there any way get the distinct values.&lt;BR /&gt;like this.&lt;/P&gt;&lt;P&gt;north plainfield&lt;BR /&gt;west homeland&lt;BR /&gt;newyork&lt;BR /&gt;so plainfield&lt;/P&gt;</description>
      <pubDate>Fri, 10 Jun 2022 01:54:05 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/How-to-remove-the-space-and-dots-and-convert-into-lowercase/m-p/345327#M234487</guid>
      <dc:creator>suri789</dc:creator>
      <dc:date>2022-06-10T01:54:05Z</dc:date>
    </item>
    <item>
      <title>Re: How to remove the space and dots and convert into lowercase in Pyspark</title>
      <link>https://community.cloudera.com/t5/Support-Questions/How-to-remove-the-space-and-dots-and-convert-into-lowercase/m-p/345723#M234622</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.cloudera.com/t5/user/viewprofilepage/user-id/98475"&gt;@suri789&lt;/a&gt;&amp;nbsp;Can you try this below and share your feedback?&lt;/P&gt;&lt;PRE&gt;&amp;gt;&amp;gt;&amp;gt; df.show()&lt;BR /&gt;+----------------+&lt;BR /&gt;| value          |&lt;BR /&gt;+----------------+&lt;BR /&gt;| N. Plainfield|&lt;BR /&gt;|North Plainfield|&lt;BR /&gt;| West Home Land|&lt;BR /&gt;| NEWYORK|&lt;BR /&gt;| newyork|&lt;BR /&gt;| So. Plainfield|&lt;BR /&gt;| S. Plaindield|&lt;BR /&gt;| s Plaindield|&lt;BR /&gt;|North Plainfield|&lt;BR /&gt;+----------------+&lt;BR /&gt;&amp;gt;&amp;gt;&amp;gt; from pyspark.sql.functions import regexp_replace, lower&lt;BR /&gt;&amp;gt;&amp;gt;&amp;gt; df_tmp=df.withColumn('value', regexp_replace('value', r'\.',''))&lt;BR /&gt;&amp;gt;&amp;gt;&amp;gt; df_tmp.withColumn('value', lower(df_tmp.value)).distinct().show()&lt;BR /&gt;+----------------+&lt;BR /&gt;| value          |&lt;BR /&gt;+----------------+&lt;BR /&gt;| s plaindield|&lt;BR /&gt;| n plainfield|&lt;BR /&gt;| west home land|&lt;BR /&gt;| newyork|&lt;BR /&gt;| so plainfield|&lt;BR /&gt;|north plainfield|&lt;BR /&gt;+----------------+&lt;/PRE&gt;</description>
      <pubDate>Wed, 15 Jun 2022 17:41:51 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/How-to-remove-the-space-and-dots-and-convert-into-lowercase/m-p/345723#M234622</guid>
      <dc:creator>jagadeesan</dc:creator>
      <dc:date>2022-06-15T17:41:51Z</dc:date>
    </item>
    <item>
      <title>Re: How to remove the space and dots and convert into lowercase in Pyspark</title>
      <link>https://community.cloudera.com/t5/Support-Questions/How-to-remove-the-space-and-dots-and-convert-into-lowercase/m-p/346603#M234932</link>
      <description>&lt;P&gt;Thanks jagadeesan,&lt;/P&gt;&lt;P&gt;&amp;nbsp;But Still your getting the duplicate values&lt;/P&gt;</description>
      <pubDate>Thu, 30 Jun 2022 12:05:14 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/How-to-remove-the-space-and-dots-and-convert-into-lowercase/m-p/346603#M234932</guid>
      <dc:creator>suri789</dc:creator>
      <dc:date>2022-06-30T12:05:14Z</dc:date>
    </item>
    <item>
      <title>Re: How to remove the space and dots and convert into lowercase in Pyspark</title>
      <link>https://community.cloudera.com/t5/Support-Questions/How-to-remove-the-space-and-dots-and-convert-into-lowercase/m-p/346604#M234933</link>
      <description>&lt;PRE&gt;so plainfield, s plainfiled both are same&lt;/PRE&gt;</description>
      <pubDate>Thu, 30 Jun 2022 12:05:59 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/How-to-remove-the-space-and-dots-and-convert-into-lowercase/m-p/346604#M234933</guid>
      <dc:creator>suri789</dc:creator>
      <dc:date>2022-06-30T12:05:59Z</dc:date>
    </item>
    <item>
      <title>Re: How to remove the space and dots and convert into lowercase in Pyspark</title>
      <link>https://community.cloudera.com/t5/Support-Questions/How-to-remove-the-space-and-dots-and-convert-into-lowercase/m-p/346616#M234943</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.cloudera.com/t5/user/viewprofilepage/user-id/98475"&gt;@suri789&lt;/a&gt;&amp;nbsp;these both are different values, I didn't see any duplicate in these.&amp;nbsp;&lt;/P&gt;&lt;PRE&gt;so plainfield&lt;/PRE&gt;&lt;PRE&gt;s plainfiled&lt;/PRE&gt;&lt;P&gt;&amp;nbsp;Also from the output, I didn't see any&amp;nbsp;&lt;SPAN&gt;duplicate values, all are distinct by the values..!&amp;nbsp;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;&lt;PRE&gt;+----------------+&lt;BR /&gt;| value          |&lt;BR /&gt;+----------------+&lt;BR /&gt;| s plaindield|&lt;BR /&gt;| n plainfield|&lt;BR /&gt;| west home land|&lt;BR /&gt;| newyork|&lt;BR /&gt;| so plainfield|&lt;BR /&gt;|north plainfield|&lt;BR /&gt;+----------------+&lt;/PRE&gt;&lt;P&gt;&amp;nbsp;&lt;BR /&gt;Please note: "n&amp;nbsp;plainfield &amp;amp; north plainfield or s plainfield &amp;amp; so plainfield" are different values, because we didn't write any custom logic like 'n' means 'north' or 's' means 'so'.&amp;nbsp; &amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Thu, 30 Jun 2022 14:27:06 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/How-to-remove-the-space-and-dots-and-convert-into-lowercase/m-p/346616#M234943</guid>
      <dc:creator>jagadeesan</dc:creator>
      <dc:date>2022-06-30T14:27:06Z</dc:date>
    </item>
  </channel>
</rss>

