<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Find Fields in Noise with Spark in Archives of Support Questions (Read Only)</title>
    <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Find-Fields-in-Noise-with-Spark/m-p/140509#M27876</link>
    <description>&lt;P&gt;So we have 100 different spreadsheets in CSV format with 20 fields.  The fields are kind of standard, but some people use First Name, some use Name or firstname, some use one name field.   Some use M and F for gender; some use 0 and 1.&lt;/P&gt;&lt;P&gt;We want to convert all these types of CSVs into one gold standard and standard fieldnames/types/rangers.&lt;/P&gt;</description>
    <pubDate>Tue, 10 May 2016 21:39:38 GMT</pubDate>
    <dc:creator>TimothySpann</dc:creator>
    <dc:date>2016-05-10T21:39:38Z</dc:date>
    <item>
      <title>Find Fields in Noise with Spark</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Find-Fields-in-Noise-with-Spark/m-p/140509#M27876</link>
      <description>&lt;P&gt;So we have 100 different spreadsheets in CSV format with 20 fields.  The fields are kind of standard, but some people use First Name, some use Name or firstname, some use one name field.   Some use M and F for gender; some use 0 and 1.&lt;/P&gt;&lt;P&gt;We want to convert all these types of CSVs into one gold standard and standard fieldnames/types/rangers.&lt;/P&gt;</description>
      <pubDate>Tue, 10 May 2016 21:39:38 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Find-Fields-in-Noise-with-Spark/m-p/140509#M27876</guid>
      <dc:creator>TimothySpann</dc:creator>
      <dc:date>2016-05-10T21:39:38Z</dc:date>
    </item>
    <item>
      <title>Re: Find Fields in Noise with Spark</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Find-Fields-in-Noise-with-Spark/m-p/140510#M27877</link>
      <description>&lt;P&gt;That is essentially master data management. There are a ton of tools out there for this ( IBM MDM has three solutions alone, Quality stage also comes to mind ) &lt;/P&gt;&lt;P&gt;Some of them may be easy for example for the gender fields you could write simple Scala UDFs that do the transformation. Today you may want to use Dataframes although I am still a fan of old fashioned RDDs. Below is an example that does a parsing using a Scala UDF you could do your cleaning in there as well. This will work whenever you can check a row based on the row alone and do not need to do a full person or entity matching.&lt;/P&gt;&lt;P&gt;&lt;A href="https://community.hortonworks.com/articles/25726/spark-streaming-explained-kafka-to-phoenix.html" target="_blank"&gt;https://community.hortonworks.com/articles/25726/spark-streaming-explained-kafka-to-phoenix.html&lt;/A&gt;&lt;/P&gt;&lt;P&gt;The moment you do not simply need to do some data cleansing but need to do a full entity matching it all gets MUCH more complicated. Here is a great answer by Henning to that topic ( and a less good answer with additional details from me ) &lt;/P&gt;&lt;P&gt;&lt;A href="https://community.hortonworks.com/questions/26849/person-matching-in-spark.html" target="_blank"&gt;https://community.hortonworks.com/questions/26849/person-matching-in-spark.html&lt;/A&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 10 May 2016 22:40:16 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Find-Fields-in-Noise-with-Spark/m-p/140510#M27877</guid>
      <dc:creator>bleonhardi</dc:creator>
      <dc:date>2016-05-10T22:40:16Z</dc:date>
    </item>
    <item>
      <title>Re: Find Fields in Noise with Spark</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Find-Fields-in-Noise-with-Spark/m-p/140511#M27878</link>
      <description>&lt;P&gt;I am wondering about a full open source solution for Master Data Management.&lt;/P&gt;</description>
      <pubDate>Tue, 10 May 2016 22:45:40 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Find-Fields-in-Noise-with-Spark/m-p/140511#M27878</guid>
      <dc:creator>TimothySpann</dc:creator>
      <dc:date>2016-05-10T22:45:40Z</dc:date>
    </item>
    <item>
      <title>Re: Find Fields in Noise with Spark</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Find-Fields-in-Noise-with-Spark/m-p/140512#M27879</link>
      <description>&lt;P&gt;Would be interesting to see. There seem to be a couple data quality tools out there in the open source commnity mural/mosaic but the last update in the repository seems to have been 4 years ago. So not sure how useful that is. &lt;/P&gt;&lt;P&gt;&lt;A href="https://java.net/projects/mosaic" target="_blank"&gt;https://java.net/projects/mosaic&lt;/A&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 10 May 2016 22:53:52 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Find-Fields-in-Noise-with-Spark/m-p/140512#M27879</guid>
      <dc:creator>bleonhardi</dc:creator>
      <dc:date>2016-05-10T22:53:52Z</dc:date>
    </item>
  </channel>
</rss>

