Support Questions

sonnyheer · ‎06-15-2017

Hi all,

We have multiple tables that need to be combined into a single table using left joins. There are many one to many relationships. Naturally - after the first join the subsequent join will produce duplicate rows. The end result is a massive table with mostly duplicates. I understand these can be removed easily in 2 ways. 1. doing a insert overwrite and selecting distinct rows. 2. group by on all final columns.

Which of these is the optimal option?

Is there a pattern in Hive that will allow adding in additional tables and removing duplicates per table (instead of all in the end)...

Thanks in advance.

sagar_girish · ‎06-16-2017

Hi @Sonny Heer,

So what I understand from your query is you've got multiple tables say A,B,C,D,etc and your selecting a query joining on A left join B left join C , etc and there are Multiple entries in table B,C,D for the Key matching with A.

If this is the case, What I would suggest you is to use Windowing function.

Select A.a,B.b,C,c
from A left join
(Select * from 
( Select B.b,B.key,ROW_NUMBER() OVER (partition by key) AS row_num from B)
where row_num=1) B
on A.key = B.key
and so on..

Try this out and let me know if it was helpful.

Cheers,

Sagar

View solution in original post

sagar_girish · ‎06-20-2017

@Sonny Heer

I think you can do that.

Instead of this:

Select B.b,B.key,ROW_NUMBER() OVER (partition by key) AS row_num from B)where row_num=1

You can use

Select B.b,B.key,ROW_NUMBER() OVER (count by key) AS row_num from B)where row_num=1

Though I am not very sure, but Hive documentation says you can use standard aggregate in Over function. Check the link below:

Hive Documentation

Cheers,

Sagar

sonnyheer · ‎06-20-2017

@Sagar Morakhia

That doesn't seem to work. based on doc it shows below, but that also requires a group by. I might be missing something.

Select COUNT(B.b),B.key,ROW_NUMBER() OVER (partition by key) AS row_num from B)where row_num=1

Cloudera Community

Support Questions

Handling Multiple joins creating duplicates

Can NiFi handle multiple users hitting a request?

Creating multiple flow files and ExecuteScript err...

CDE Jobs with Multiple CDE Repository Dependencies

Joining Collections in SOLR (Part I)

Handling multiple records in hive

Hive Query Recovery Tactics: Handling Failures thr...

phoenix creating duplicates

Spark Scala - Join multiple files using Spark

NiFi Error Handling - Design Pattern

Get duplicate records in MySql