Reply
Grg
Explorer
Posts: 24
Registered: ‎08-14-2015
Accepted Solution

Joining 3 pair-RDDs

[ Edited ]

Hi,

 

Using Spark, how can I join 3 pair-RDD?

 

I'm able to:

  • populate 2 RDD (A and B)
  • identify a common key and create 2 pair-RDD (A and B)
  • perform a join on this key and get a 3rd RDD (C)
  • populate a new RDD (D)
  • identify a common key and create 2 pair-RDD again (C and D)
  • perform a join on this key and get a 5th RDD (E)

So, to get a RDD joining the 3 files, I have to perform 2 joins.

 

Thanks :)

Greg.

Contributor
Posts: 55
Registered: ‎09-17-2013

Re: Joining 3 pair-RDDs

How about using cogroup.?

Sparks' co group can work on 3 RDDs at once.

 

The below is scala cogroup syntax i have checked, it says, it can combine two RDDs other1 and other2 at the same time.

 

def cogroup[W1, W2](other1: RDD[(K, W1)], other2: RDD[(K, W2)]): RDD[(K, (Seq[V], Seq[W1], Seq[W2]))]

For each key k in this or other1 or other2, return a resulting RDD that contains a tuple with the list of values for that key in this, other1 and other2.

 

I cannot work on spark as i do not have set up at office, otherwise, would love to try this.

 

After cogroup, you can apply mapValues and merge the three sequences

 

Thank You.

Grg
Explorer
Posts: 24
Registered: ‎08-14-2015

Re: Joining 3 pair-RDDs

Hello, 

 

Thanks for your reply, this is a very interesting functionality you have pointed out!

I will have a look at this and check if it also works for complex joins (like outer jons).

 

Greg.

Highlighted
Grg
Explorer
Posts: 24
Registered: ‎08-14-2015

Re: Joining 3 pair-RDDs

I would summarize saying that one may use SparkSql (or Hive) in order to write SQL queries with complex joining. Else, with Spark, one is able and must describe the execution plan, so he has to write each join separately.