Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Joining 3 pair-RDDs

SOLVED Go to solution

Joining 3 pair-RDDs

Explorer

Hi,

 

Using Spark, how can I join 3 pair-RDD?

 

I'm able to:

  • populate 2 RDD (A and B)
  • identify a common key and create 2 pair-RDD (A and B)
  • perform a join on this key and get a 3rd RDD (C)
  • populate a new RDD (D)
  • identify a common key and create 2 pair-RDD again (C and D)
  • perform a join on this key and get a 5th RDD (E)

So, to get a RDD joining the 3 files, I have to perform 2 joins.

 

Thanks :)

Greg.

1 ACCEPTED SOLUTION

Accepted Solutions

Re: Joining 3 pair-RDDs

Explorer
I would summarize saying that one may use SparkSql (or Hive) in order to write SQL queries with complex joining. Else, with Spark, one is able and must describe the execution plan, so he has to write each join separately.
3 REPLIES 3

Re: Joining 3 pair-RDDs

Explorer

How about using cogroup.?

Sparks' co group can work on 3 RDDs at once.

 

The below is scala cogroup syntax i have checked, it says, it can combine two RDDs other1 and other2 at the same time.

 

def cogroup[W1, W2](other1: RDD[(K, W1)], other2: RDD[(K, W2)]): RDD[(K, (Seq[V], Seq[W1], Seq[W2]))]

For each key k in this or other1 or other2, return a resulting RDD that contains a tuple with the list of values for that key in this, other1 and other2.

 

I cannot work on spark as i do not have set up at office, otherwise, would love to try this.

 

After cogroup, you can apply mapValues and merge the three sequences

 

Thank You.

Re: Joining 3 pair-RDDs

Explorer

Hello, 

 

Thanks for your reply, this is a very interesting functionality you have pointed out!

I will have a look at this and check if it also works for complex joins (like outer jons).

 

Greg.

Re: Joining 3 pair-RDDs

Explorer
I would summarize saying that one may use SparkSql (or Hive) in order to write SQL queries with complex joining. Else, with Spark, one is able and must describe the execution plan, so he has to write each join separately.