I have been stuck on this issue for a few days and am at a complete standstill as to what I can do next. Here is a description of my problem.
There are four tables imported into Hive with multiple data points. Unfortunately, the only join-able key is employee names. The problem with this key is that the names are spelled differently/formatted differently across data sources. I downloaded a fuzzy match UDF, but I'm having a hard time LEFT OUTER JOINING the tables onto each other using the fuzzy match UDF because the results are non-equality.
I'm looking for suggestions on how to join these tables onto each other in HIVE without inner joining. I would do this on the server side sql, but these tables are from different data sources/databases.
Thanks for the help!
typical pattern is: select * from table_1 left join table_2 on (lcase(column_x) = lcase(column_y))
can you produce a canonical format on both sides (like lcase above) and join on that?
Thanks for the reply! Unfortunately, that will not work in this case because the names are spelled differently. Example:
Table 1 Name = Tyler LeMartin
Table 2 Name = Tyler Le-Martin
Table 3 Name = Tyler Martin
I'd need to join the tables on to each based on the similiarity being >= 95%
I see. This is going to be inefficient, because Hive can't shuffle the data based on keys. You can do it like this probably:
select ... from table_1, table_2 where fuzzy_match(column_x, column_y) (inner join branch)
select ... from table_1 where not exists (select 1 from table_2 where fuzzy_match(column_x, column_y)) s1; (left outer branch)
With this: https://issues.apache.org/jira/browse/HIVE-14731 you will at least not have single node cross products, but it will still be an expensive operation.