Support Questions

Find answers, ask questions, and share your expertise

Find Fields in Noise with Spark

avatar
Master Guru

So we have 100 different spreadsheets in CSV format with 20 fields. The fields are kind of standard, but some people use First Name, some use Name or firstname, some use one name field. Some use M and F for gender; some use 0 and 1.

We want to convert all these types of CSVs into one gold standard and standard fieldnames/types/rangers.

1 ACCEPTED SOLUTION

avatar
Master Guru

That is essentially master data management. There are a ton of tools out there for this ( IBM MDM has three solutions alone, Quality stage also comes to mind )

Some of them may be easy for example for the gender fields you could write simple Scala UDFs that do the transformation. Today you may want to use Dataframes although I am still a fan of old fashioned RDDs. Below is an example that does a parsing using a Scala UDF you could do your cleaning in there as well. This will work whenever you can check a row based on the row alone and do not need to do a full person or entity matching.

https://community.hortonworks.com/articles/25726/spark-streaming-explained-kafka-to-phoenix.html

The moment you do not simply need to do some data cleansing but need to do a full entity matching it all gets MUCH more complicated. Here is a great answer by Henning to that topic ( and a less good answer with additional details from me )

https://community.hortonworks.com/questions/26849/person-matching-in-spark.html

View solution in original post

3 REPLIES 3

avatar
Master Guru

That is essentially master data management. There are a ton of tools out there for this ( IBM MDM has three solutions alone, Quality stage also comes to mind )

Some of them may be easy for example for the gender fields you could write simple Scala UDFs that do the transformation. Today you may want to use Dataframes although I am still a fan of old fashioned RDDs. Below is an example that does a parsing using a Scala UDF you could do your cleaning in there as well. This will work whenever you can check a row based on the row alone and do not need to do a full person or entity matching.

https://community.hortonworks.com/articles/25726/spark-streaming-explained-kafka-to-phoenix.html

The moment you do not simply need to do some data cleansing but need to do a full entity matching it all gets MUCH more complicated. Here is a great answer by Henning to that topic ( and a less good answer with additional details from me )

https://community.hortonworks.com/questions/26849/person-matching-in-spark.html

avatar
Master Guru

I am wondering about a full open source solution for Master Data Management.

avatar
Master Guru

Would be interesting to see. There seem to be a couple data quality tools out there in the open source commnity mural/mosaic but the last update in the repository seems to have been 4 years ago. So not sure how useful that is.

https://java.net/projects/mosaic