Please suggest me right approach to solve this scenario ….
I have a DB2 table member to store all members details along with address details (column name such as , street_address_1, street_address_2 , Zip_code, state , country). Now I have find out member how are leaving in the same address , considering they are the member of same household.
But the problem is that , address is a text and address given by the member are not in same format , some one given address as “ABC Apartment , Flat 303, 12 , New York City” where as some one given the same address as “ABC Appt. , Flat 303, 12 , NYC ”
In my logic I have to consider both are same and this two members are belonging to the same house hold.
In DB2 member table I have around 10M data. I am thinking of to use Spark+scala with SoundEX API. But I guess I can use elastic search also with group by + fuzzy logic , But I am not sure how this can be done where data is in DB2 table as I never work on elastic search.
In Spark based approach also I have to dump 10M data in Hadoop environment , and have compare one by one for 10M data with soundEX encode value , which I think will be very very time consuming and not a smart way to approach this scenario and spark does not have direct support also for Fuzzy implementation.
Can any one suggest me the right approach for this scenario along with the process.