Created 05-10-2016 02:39 PM
So we have 100 different spreadsheets in CSV format with 20 fields. The fields are kind of standard, but some people use First Name, some use Name or firstname, some use one name field. Some use M and F for gender; some use 0 and 1.
We want to convert all these types of CSVs into one gold standard and standard fieldnames/types/rangers.
Created 05-10-2016 03:40 PM
That is essentially master data management. There are a ton of tools out there for this ( IBM MDM has three solutions alone, Quality stage also comes to mind )
Some of them may be easy for example for the gender fields you could write simple Scala UDFs that do the transformation. Today you may want to use Dataframes although I am still a fan of old fashioned RDDs. Below is an example that does a parsing using a Scala UDF you could do your cleaning in there as well. This will work whenever you can check a row based on the row alone and do not need to do a full person or entity matching.
https://community.hortonworks.com/articles/25726/spark-streaming-explained-kafka-to-phoenix.html
The moment you do not simply need to do some data cleansing but need to do a full entity matching it all gets MUCH more complicated. Here is a great answer by Henning to that topic ( and a less good answer with additional details from me )
https://community.hortonworks.com/questions/26849/person-matching-in-spark.html
Created 05-10-2016 03:40 PM
That is essentially master data management. There are a ton of tools out there for this ( IBM MDM has three solutions alone, Quality stage also comes to mind )
Some of them may be easy for example for the gender fields you could write simple Scala UDFs that do the transformation. Today you may want to use Dataframes although I am still a fan of old fashioned RDDs. Below is an example that does a parsing using a Scala UDF you could do your cleaning in there as well. This will work whenever you can check a row based on the row alone and do not need to do a full person or entity matching.
https://community.hortonworks.com/articles/25726/spark-streaming-explained-kafka-to-phoenix.html
The moment you do not simply need to do some data cleansing but need to do a full entity matching it all gets MUCH more complicated. Here is a great answer by Henning to that topic ( and a less good answer with additional details from me )
https://community.hortonworks.com/questions/26849/person-matching-in-spark.html
Created 05-10-2016 03:45 PM
I am wondering about a full open source solution for Master Data Management.
Created 05-10-2016 03:53 PM
Would be interesting to see. There seem to be a couple data quality tools out there in the open source commnity mural/mosaic but the last update in the repository seems to have been 4 years ago. So not sure how useful that is.