Support Questions

Find answers, ask questions, and share your expertise

Handling Multiple joins creating duplicates

avatar
Contributor

Hi all,

We have multiple tables that need to be combined into a single table using left joins. There are many one to many relationships. Naturally - after the first join the subsequent join will produce duplicate rows. The end result is a massive table with mostly duplicates. I understand these can be removed easily in 2 ways. 1. doing a insert overwrite and selecting distinct rows. 2. group by on all final columns.

Which of these is the optimal option?

Is there a pattern in Hive that will allow adding in additional tables and removing duplicates per table (instead of all in the end)...

Thanks in advance.

1 ACCEPTED SOLUTION

avatar
Rising Star

Hi @Sonny Heer,

So what I understand from your query is you've got multiple tables say A,B,C,D,etc and your selecting a query joining on A left join B left join C , etc and there are Multiple entries in table B,C,D for the Key matching with A.

If this is the case, What I would suggest you is to use Windowing function.

Select A.a,B.b,C,c
from A left join
(Select * from 
( Select B.b,B.key,ROW_NUMBER() OVER (partition by key) AS row_num from B)
where row_num=1) B
on A.key = B.key
and so on..

Try this out and let me know if it was helpful.

Cheers,

Sagar

View solution in original post

11 REPLIES 11

avatar
Rising Star

@Sonny Heer

I think you can do that.

Instead of this:

Select B.b,B.key,ROW_NUMBER() OVER (partition by key) AS row_num from B)where row_num=1

You can use

Select B.b,B.key,ROW_NUMBER() OVER (count by key) AS row_num from B)where row_num=1

Though I am not very sure, but Hive documentation says you can use standard aggregate in Over function. Check the link below:

Hive Documentation

Cheers,

Sagar

avatar
Contributor
@Sagar Morakhia

That doesn't seem to work. based on doc it shows below, but that also requires a group by. I might be missing something.

Select COUNT(B.b),B.key,ROW_NUMBER() OVER (partition by key) AS row_num from B)where row_num=1