Support Questions

Find answers, ask questions, and share your expertise

Handling Multiple joins creating duplicates

avatar
Contributor

Hi all,

We have multiple tables that need to be combined into a single table using left joins. There are many one to many relationships. Naturally - after the first join the subsequent join will produce duplicate rows. The end result is a massive table with mostly duplicates. I understand these can be removed easily in 2 ways. 1. doing a insert overwrite and selecting distinct rows. 2. group by on all final columns.

Which of these is the optimal option?

Is there a pattern in Hive that will allow adding in additional tables and removing duplicates per table (instead of all in the end)...

Thanks in advance.

1 ACCEPTED SOLUTION

avatar
Rising Star

Hi @Sonny Heer,

So what I understand from your query is you've got multiple tables say A,B,C,D,etc and your selecting a query joining on A left join B left join C , etc and there are Multiple entries in table B,C,D for the Key matching with A.

If this is the case, What I would suggest you is to use Windowing function.

Select A.a,B.b,C,c
from A left join
(Select * from 
( Select B.b,B.key,ROW_NUMBER() OVER (partition by key) AS row_num from B)
where row_num=1) B
on A.key = B.key
and so on..

Try this out and let me know if it was helpful.

Cheers,

Sagar

View solution in original post

11 REPLIES 11

avatar
Rising Star

Hi @Sonny Heer,

So what I understand from your query is you've got multiple tables say A,B,C,D,etc and your selecting a query joining on A left join B left join C , etc and there are Multiple entries in table B,C,D for the Key matching with A.

If this is the case, What I would suggest you is to use Windowing function.

Select A.a,B.b,C,c
from A left join
(Select * from 
( Select B.b,B.key,ROW_NUMBER() OVER (partition by key) AS row_num from B)
where row_num=1) B
on A.key = B.key
and so on..

Try this out and let me know if it was helpful.

Cheers,

Sagar

avatar
Contributor

Thanks for the response. I decided to do inner select inside the left join to aggregate the records as needed. Its not clear how the windowing function helps yet (it will help if you can descirbe what you get at each stage using that function) - I will try it out and get back here.

avatar
Contributor

@Sagar Morakhia

I realize what I'm after is a row_num on A table. I actually want all from the join tables. but as more joins are added the main join key (e.g. A.key) keeps increasing. whats the format for joining to using unique A.key to others??

I'll try out some variations. Thanks for your help on this..

avatar
Rising Star

@Sonny Heer

Apologies but I din get your question. Can you provide me with an example?

avatar
Contributor

@Sagar Morakhia

Thanks for the response. Although that gives me the first matching row from B, what if I want other columns from B. for example B.b column is duplicated per A/B.key. Although with row_num = 1 - I only get the first one right? whats diff from just doing a GROUP BY on key? Is windowing faster?

So in this regard windowing is helpful because I can control matching from the outer table (A) to all other child tables? e.g. the C join won't be based on the big table of A&B but instead on A.key. But this is still achievable with subquery and group by? correct me if i'm wrong...

avatar
Rising Star
@Sonny Heer

So, we definitely need to use Sub-query in any case ( Group by or Windowing). And yes, Windowing is much faster than Group By, For the simple logic, Say you have 1 million rows, Group by will 1st Sort the data and then Group by the Key, whereas Windowing will just Sort and give you the 1st entry.

However, If your dataset is not large enough, you can live with Group by. It will hardly make any difference.

Can you please try and run both the queries (Windowing & Group by) and check a couple of things:

1. No. of Map task /Reduce tasks in both the queries.

2. If the Time Difference between 2 queries is more than 2 Mins, or it's almost the same.

avatar
Contributor

@Sagar Morakhia

Yeah that makes sense. The difference being if I want to count some other column in B (e.g. B.b). thats where group by is the only option. Is there a way to get the highest count via row_num vs doing where row_num = 1? most of the time you want to join to a many table, but do something with the repeating column. In our case we want to count by key..

avatar

I dont think window function works better when the data is huge. when you use windowing function all the data gets accumulated in a single reducer which will end up in performance issue. I would suggest to go with distinct/ group by function which you have mentioned to avoid such issues.

avatar
Rising Star

@Bala Vignesh N V: I have never heard this thing. Windowing Function works only on 1 reducer... Thats only for Order by if I am not wrong.. Can you share any documentation If you have for the same.