Created on 08-14-2015 07:29 AM - edited 09-16-2022 02:37 AM
Hi,
Using Spark, how can I join 3 pair-RDD?
I'm able to:
So, to get a RDD joining the 3 files, I have to perform 2 joins.
Thanks 🙂
Greg.
Created 09-01-2015 12:04 AM
Created 08-26-2015 03:04 AM
How about using cogroup.?
Sparks' co group can work on 3 RDDs at once.
The below is scala cogroup syntax i have checked, it says, it can combine two RDDs other1 and other2 at the same time.
def cogroup[W1, W2](other1: RDD[(K, W1)], other2: RDD[(K, W2)]): RDD[(K, (Seq[V], Seq[W1], Seq[W2]))]
For each key k in this or other1 or other2, return a resulting RDD that contains a tuple with the list of values for that key in this, other1 and other2.
I cannot work on spark as i do not have set up at office, otherwise, would love to try this.
After cogroup, you can apply mapValues and merge the three sequences
Thank You.
Created 08-26-2015 04:25 AM
Hello,
Thanks for your reply, this is a very interesting functionality you have pointed out!
I will have a look at this and check if it also works for complex joins (like outer jons).
Greg.
Created 09-01-2015 12:04 AM