I have a use case where i have two different structured datasets from different companies with person details like firstname,lastname,DOB,email,postcode,Date etc. I dont have an ID to match the person details in the two datasets. I believe to use SPark Dataframes regex matching to join the two datasets and find the people who are matched in the two datasets as i dont have an id to join i have to use these firstname,lastname,DOB,email,postcode,Date to start with. Please suggest me any other business logic if its more effective thank you
Take care, you don't want to do a full regex of your smallest dataset for each record on your largest dataset. That would not be optimised.
If you feel confident with the spelling of your firstname and lastname, maybe you can do a (broadcast if the smallest is smaller than spark.sql.autoBroadcastJoinThreshold) join by using the tupple (firstname,lastname) as the join key, and then have a final regex check on the joined values to filter out based on more complex columns (emails, dob etc...). The idea of the JOIN is to reduce as much as possible the number of regex comparisons you will have to make.
So, your final query could be something similar to:
sqlContext.sql( "SELECT ds1.*, ds2.* FROM ds1 JOIN ds2 ON (ds1.firstname = ds2.firstname AND ds1.lastname = ds2.lastname) WHERE myFilterRegexUDF( ds1.*, ds2.*)")
Depending on the complexity of your filter, your might not need to write a UDF in your WHERE condition, but you may do some standard SQL instead.
Yes, you can use BroadcastHashJoin in Spark. And in most of the cases I would recommend you to use it and to avoid doing a ShuffleJoin instead.
There are a lot of commercial tools that provide person matching:
They normally compute a similarity score for a list of identified persons providing scoring points for:
" A fuzzy match of firstname and last name using something like Levenshtein distance or Phonetic matching"
( and special rules for middle names etc. )
" A match of some aspects of the address esp. the postcode"
"Email address direct match perhaps with special rules like gmail = googlemail"
And do this using a join between the data sets. Normally using a pretty complex UDF or matching function. Be sure that you have some special rules for perfectly matching names that are faster.
Some nice links:
You normally do not want to do your complex matching function between all of your entities because this is very expensive so you often have a grouping of some feature that clearly specifies two persons to be different. So if you know that people from different countries will never be matched with each other you could group them together and only compare persons in one country. Although you would not catch expats with this. Alternatively you could decide that the first letter of the last name will not be miswritten but you wouldn't catch people who marry and change their name with that. So you could think of a logic that is good for you.
That one has a very good explanation about fuzzy matching and how to make it faster by making fast computations for entities that have a minimum distance ( are very dissimilar )
If you want to go phonetic you can reduce names to their phonetic minimum and just do an equality between them but that may result in problems if the phonetic reduction algorithm is not perfect:
So you see this is a very big topic.
You can also buy some premade tools, BigMatch for HAdoop from IBM uses some of their Master Data Management technologies and is available on HDP.
This is really a broad topic.
The general complexity of this problem is NxM and can grow large quite quickly. Having two data sets with each a 1000 users you need to 1000x1000 = 1 Million comparison already
To reduce complexity typically a blocking key is chosen. What could be a good blocking key in your case, could be the age, first letter of the name, email domain, some combination of them, or some advanced vector space approaches.
This a learning based approach and can be done in a unsupervised or supervised manner. For supervision you would need a training set, of which I assume you don't have.
Probably the most common approach in this case is a decision tree. Some thing like first_name=0.9% + email=0.8% => match
3. Difference between matching and mapping
You will generate at worst NxM matchings, but what you want is a mapping of user:B -> user:A, so you need to pick the matching with the highest precision. There is not a single matching.
4. String similarities
There are 3 types with advantages and disadvantages
4.1 Character based
Levensthein, JaroWinkler, ... Basically they count how many steps are needed to turn one string into another string. This are good, but have some disadvantage if there is more info in the strings. For example "Hans Maier" vs "Hans P. S. Maier" Let's assume they are the same, just one has the abbreviation of it's middle names in it. Character based similiarties punish this more than for example vector based approaches.
4.2 Vector based approach
Take the string as a vector (H, A, N, S) vs. (H, E, N, S) and do a cosine similarity to see what is the distance in the vector space. They don't punish "edit distance" so hard, but depending on the approach could potential have non-equal strings match as being equal. Advanced would be to use the tf-idfs score for each character to have an emphasis on the character. Additionally you can use n-grams ((_HA), (AN), (NS), (S_)) vectors of the names and use the tf-idfs for them. Could lead to good results as well.
4.3 Training based string similarities approaches
Most recent research revealed that training based approaches do lead to better results. Basically they work like character based approach just that you train what transition is punished how much. So transition of 'P.' -> '' in a name could be trained to not be punished as much, since it is just the abbreviation of a middle name. This works very well, but also requires a training set.
What I would do if I were you?
You need to do exploratory anaylisis of your data. Email is already a very good key. So, check how many of the data would already match using this approach. How many fields have a first, last, email, ... So you need to understand your data better first, and based on this do consider the approaches discussed above.