question Find Fields in Noise with Spark in Archives of Support Questions (Read Only)

Find Fields in Noise with Spark

TimothySpann — Tue, 10 May 2016 21:39:38 GMT

So we have 100 different spreadsheets in CSV format with 20 fields. The fields are kind of standard, but some people use First Name, some use Name or firstname, some use one name field. Some use M and F for gender; some use 0 and 1.

We want to convert all these types of CSVs into one gold standard and standard fieldnames/types/rangers.

Re: Find Fields in Noise with Spark

bleonhardi — Tue, 10 May 2016 22:40:16 GMT

That is essentially master data management. There are a ton of tools out there for this ( IBM MDM has three solutions alone, Quality stage also comes to mind )

Some of them may be easy for example for the gender fields you could write simple Scala UDFs that do the transformation. Today you may want to use Dataframes although I am still a fan of old fashioned RDDs. Below is an example that does a parsing using a Scala UDF you could do your cleaning in there as well. This will work whenever you can check a row based on the row alone and do not need to do a full person or entity matching.

https://community.hortonworks.com/articles/25726/spark-streaming-explained-kafka-to-phoenix.html

The moment you do not simply need to do some data cleansing but need to do a full entity matching it all gets MUCH more complicated. Here is a great answer by Henning to that topic ( and a less good answer with additional details from me )

https://community.hortonworks.com/questions/26849/person-matching-in-spark.html

Re: Find Fields in Noise with Spark

TimothySpann — Tue, 10 May 2016 22:45:40 GMT

I am wondering about a full open source solution for Master Data Management.

Re: Find Fields in Noise with Spark

bleonhardi — Tue, 10 May 2016 22:53:52 GMT

Would be interesting to see. There seem to be a couple data quality tools out there in the open source commnity mural/mosaic but the last update in the repository seems to have been 4 years ago. So not sure how useful that is.

https://java.net/projects/mosaic