Support Questions
Find answers, ask questions, and share your expertise

Fuzzy Match Join Multiple Tables

Highlighted

Fuzzy Match Join Multiple Tables

New Contributor

Hello!

I have been stuck on this issue for a few days and am at a complete standstill as to what I can do next. Here is a description of my problem.

There are four tables imported into Hive with multiple data points. Unfortunately, the only join-able key is employee names. The problem with this key is that the names are spelled differently/formatted differently across data sources. I downloaded a fuzzy match UDF, but I'm having a hard time LEFT OUTER JOINING the tables onto each other using the fuzzy match UDF because the results are non-equality.

I'm looking for suggestions on how to join these tables onto each other in HIVE without inner joining. I would do this on the server side sql, but these tables are from different data sources/databases.

Thanks for the help!

3 REPLIES 3
Highlighted

Re: Fuzzy Match Join Multiple Tables

Explorer

typical pattern is: select * from table_1 left join table_2 on (lcase(column_x) = lcase(column_y))

can you produce a canonical format on both sides (like lcase above) and join on that?

Re: Fuzzy Match Join Multiple Tables

New Contributor

Thanks for the reply! Unfortunately, that will not work in this case because the names are spelled differently. Example:

Table 1 Name = Tyler LeMartin

Table 2 Name = Tyler Le-Martin

Table 3 Name = Tyler Martin

I'd need to join the tables on to each based on the similiarity being >= 95%

Highlighted

Re: Fuzzy Match Join Multiple Tables

Explorer

I see. This is going to be inefficient, because Hive can't shuffle the data based on keys. You can do it like this probably:

select ... from table_1, table_2 where fuzzy_match(column_x, column_y) (inner join branch)

union all

select ... from table_1 where not exists (select 1 from table_2 where fuzzy_match(column_x, column_y)) s1; (left outer branch)

With this: https://issues.apache.org/jira/browse/HIVE-14731 you will at least not have single node cross products, but it will still be an expensive operation.