Created 12-04-2017 05:53 PM
Hello!
I have been stuck on this issue for a few days and am at a complete standstill as to what I can do next. Here is a description of my problem.
There are four tables imported into Hive with multiple data points. Unfortunately, the only join-able key is employee names. The problem with this key is that the names are spelled differently/formatted differently across data sources. I downloaded a fuzzy match UDF, but I'm having a hard time LEFT OUTER JOINING the tables onto each other using the fuzzy match UDF because the results are non-equality.
I'm looking for suggestions on how to join these tables onto each other in HIVE without inner joining. I would do this on the server side sql, but these tables are from different data sources/databases.
Thanks for the help!
Created 12-04-2017 06:30 PM
typical pattern is: select * from table_1 left join table_2 on (lcase(column_x) = lcase(column_y))
can you produce a canonical format on both sides (like lcase above) and join on that?
Created 12-04-2017 07:08 PM
Thanks for the reply! Unfortunately, that will not work in this case because the names are spelled differently. Example:
Table 1 Name = Tyler LeMartin
Table 2 Name = Tyler Le-Martin
Table 3 Name = Tyler Martin
I'd need to join the tables on to each based on the similiarity being >= 95%
Created 12-04-2017 11:09 PM
I see. This is going to be inefficient, because Hive can't shuffle the data based on keys. You can do it like this probably:
select ... from table_1, table_2 where fuzzy_match(column_x, column_y) (inner join branch)
union all
select ... from table_1 where not exists (select 1 from table_2 where fuzzy_match(column_x, column_y)) s1; (left outer branch)
With this: https://issues.apache.org/jira/browse/HIVE-14731 you will at least not have single node cross products, but it will still be an expensive operation.