Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Handling Multiple joins creating duplicates

avatar
Contributor

Hi all,

We have multiple tables that need to be combined into a single table using left joins. There are many one to many relationships. Naturally - after the first join the subsequent join will produce duplicate rows. The end result is a massive table with mostly duplicates. I understand these can be removed easily in 2 ways. 1. doing a insert overwrite and selecting distinct rows. 2. group by on all final columns.

Which of these is the optimal option?

Is there a pattern in Hive that will allow adding in additional tables and removing duplicates per table (instead of all in the end)...

Thanks in advance.

1 ACCEPTED SOLUTION

avatar
Rising Star

Hi @Sonny Heer,

So what I understand from your query is you've got multiple tables say A,B,C,D,etc and your selecting a query joining on A left join B left join C , etc and there are Multiple entries in table B,C,D for the Key matching with A.

If this is the case, What I would suggest you is to use Windowing function.

Select A.a,B.b,C,c
from A left join
(Select * from 
( Select B.b,B.key,ROW_NUMBER() OVER (partition by key) AS row_num from B)
where row_num=1) B
on A.key = B.key
and so on..

Try this out and let me know if it was helpful.

Cheers,

Sagar

View solution in original post

11 REPLIES 11

avatar
Rising Star

@Sonny Heer

I think you can do that.

Instead of this:

Select B.b,B.key,ROW_NUMBER() OVER (partition by key) AS row_num from B)where row_num=1

You can use

Select B.b,B.key,ROW_NUMBER() OVER (count by key) AS row_num from B)where row_num=1

Though I am not very sure, but Hive documentation says you can use standard aggregate in Over function. Check the link below:

Hive Documentation

Cheers,

Sagar

avatar
Contributor
@Sagar Morakhia

That doesn't seem to work. based on doc it shows below, but that also requires a group by. I might be missing something.

Select COUNT(B.b),B.key,ROW_NUMBER() OVER (partition by key) AS row_num from B)where row_num=1