Created 06-15-2017 04:14 PM
Hi all,
We have multiple tables that need to be combined into a single table using left joins. There are many one to many relationships. Naturally - after the first join the subsequent join will produce duplicate rows. The end result is a massive table with mostly duplicates. I understand these can be removed easily in 2 ways. 1. doing a insert overwrite and selecting distinct rows. 2. group by on all final columns.
Which of these is the optimal option?
Is there a pattern in Hive that will allow adding in additional tables and removing duplicates per table (instead of all in the end)...
Thanks in advance.
Created 06-16-2017 12:14 PM
Hi @Sonny Heer,
So what I understand from your query is you've got multiple tables say A,B,C,D,etc and your selecting a query joining on A left join B left join C , etc and there are Multiple entries in table B,C,D for the Key matching with A.
If this is the case, What I would suggest you is to use Windowing function.
Select A.a,B.b,C,c from A left join (Select * from ( Select B.b,B.key,ROW_NUMBER() OVER (partition by key) AS row_num from B) where row_num=1) B on A.key = B.key and so on..
Try this out and let me know if it was helpful.
Cheers,
Sagar
Created 06-20-2017 06:25 AM
I think you can do that.
Instead of this:
Select B.b,B.key,ROW_NUMBER() OVER (partition by key) AS row_num from B)where row_num=1
You can use
Select B.b,B.key,ROW_NUMBER() OVER (count by key) AS row_num from B)where row_num=1
Though I am not very sure, but Hive documentation says you can use standard aggregate in Over function. Check the link below:
Cheers,
Sagar
Created 06-20-2017 02:45 PM
That doesn't seem to work. based on doc it shows below, but that also requires a group by. I might be missing something.
Select COUNT(B.b),B.key,ROW_NUMBER() OVER (partition by key) AS row_num from B)where row_num=1