Support Questions

Find answers, ask questions, and share your expertise

Fuzzy Match Join Multiple Tables

avatar

Hello!

I have been stuck on this issue for a few days and am at a complete standstill as to what I can do next. Here is a description of my problem.

There are four tables imported into Hive with multiple data points. Unfortunately, the only join-able key is employee names. The problem with this key is that the names are spelled differently/formatted differently across data sources. I downloaded a fuzzy match UDF, but I'm having a hard time LEFT OUTER JOINING the tables onto each other using the fuzzy match UDF because the results are non-equality.

I'm looking for suggestions on how to join these tables onto each other in HIVE without inner joining. I would do this on the server side sql, but these tables are from different data sources/databases.

Thanks for the help!

3 REPLIES 3

avatar
Contributor

typical pattern is: select * from table_1 left join table_2 on (lcase(column_x) = lcase(column_y))

can you produce a canonical format on both sides (like lcase above) and join on that?

avatar

Thanks for the reply! Unfortunately, that will not work in this case because the names are spelled differently. Example:

Table 1 Name = Tyler LeMartin

Table 2 Name = Tyler Le-Martin

Table 3 Name = Tyler Martin

I'd need to join the tables on to each based on the similiarity being >= 95%

avatar
Contributor

I see. This is going to be inefficient, because Hive can't shuffle the data based on keys. You can do it like this probably:

select ... from table_1, table_2 where fuzzy_match(column_x, column_y) (inner join branch)

union all

select ... from table_1 where not exists (select 1 from table_2 where fuzzy_match(column_x, column_y)) s1; (left outer branch)

With this: https://issues.apache.org/jira/browse/HIVE-14731 you will at least not have single node cross products, but it will still be an expensive operation.