Support Questions

Find answers, ask questions, and share your expertise

Joining 3 pair-RDDs

avatar
Contributor

Hi,

 

Using Spark, how can I join 3 pair-RDD?

 

I'm able to:

  • populate 2 RDD (A and B)
  • identify a common key and create 2 pair-RDD (A and B)
  • perform a join on this key and get a 3rd RDD (C)
  • populate a new RDD (D)
  • identify a common key and create 2 pair-RDD again (C and D)
  • perform a join on this key and get a 5th RDD (E)

So, to get a RDD joining the 3 files, I have to perform 2 joins.

 

Thanks 🙂

Greg.

1 ACCEPTED SOLUTION

avatar
Contributor
I would summarize saying that one may use SparkSql (or Hive) in order to write SQL queries with complex joining. Else, with Spark, one is able and must describe the execution plan, so he has to write each join separately.

View solution in original post

3 REPLIES 3

avatar
Rising Star

How about using cogroup.?

Sparks' co group can work on 3 RDDs at once.

 

The below is scala cogroup syntax i have checked, it says, it can combine two RDDs other1 and other2 at the same time.

 

def cogroup[W1, W2](other1: RDD[(K, W1)], other2: RDD[(K, W2)]): RDD[(K, (Seq[V], Seq[W1], Seq[W2]))]

For each key k in this or other1 or other2, return a resulting RDD that contains a tuple with the list of values for that key in this, other1 and other2.

 

I cannot work on spark as i do not have set up at office, otherwise, would love to try this.

 

After cogroup, you can apply mapValues and merge the three sequences

 

Thank You.

avatar
Contributor

Hello, 

 

Thanks for your reply, this is a very interesting functionality you have pointed out!

I will have a look at this and check if it also works for complex joins (like outer jons).

 

Greg.

avatar
Contributor
I would summarize saying that one may use SparkSql (or Hive) in order to write SQL queries with complex joining. Else, with Spark, one is able and must describe the execution plan, so he has to write each join separately.