Using Spark, how can I join 3 pair-RDD?
I'm able to:
So, to get a RDD joining the 3 files, I have to perform 2 joins.
How about using cogroup.?
Sparks' co group can work on 3 RDDs at once.
The below is scala cogroup syntax i have checked, it says, it can combine two RDDs other1 and other2 at the same time.
For each key k in this or other1 or other2, return a resulting RDD that contains a tuple with the list of values for that key in this, other1 and other2.
I cannot work on spark as i do not have set up at office, otherwise, would love to try this.
After cogroup, you can apply mapValues and merge the three sequences
Thanks for your reply, this is a very interesting functionality you have pointed out!
I will have a look at this and check if it also works for complex joins (like outer jons).