Archives of Support Questions (Read Only)

This is an archived board for historical reference. Information and links may no longer be available or relevant
Announcements
This board is archived and read-only for historical reference. To ask a new question, please post a new topic on the appropriate active board.

join two grouped by key RDD's

avatar
Visitor

In How-to: Tune Your Apache Spark Jobs (Part 1), the following tip is given:

Avoid the flatMap-join-groupBy pattern. When two datasets are already grouped by key and you want to join them and keep them grouped, you can just use cogroup. That avoids all the overhead associated with unpacking and repacking the groups.

Why can't I just apply a join directly on the rdd's that are already grouped by? Wouldn't that yield the same result as in cogroup?

1 ACCEPTED SOLUTION

avatar
Master Collaborator

The essential point here is that you want to avoid a shuffle, and you can avoid a shuffle if both RDDs are partitioned in the same way, because then all values for the same key are already on 1 partition in each RDD. join calls cogroup so yes both can accomplish this, as long as both RDDs have the same partitioner. This won't be true, however, if you first flatMap one of the RDDs which can't be known to retain the partitioning.

View solution in original post

1 REPLY 1

avatar
Master Collaborator

The essential point here is that you want to avoid a shuffle, and you can avoid a shuffle if both RDDs are partitioned in the same way, because then all values for the same key are already on 1 partition in each RDD. join calls cogroup so yes both can accomplish this, as long as both RDDs have the same partitioner. This won't be true, however, if you first flatMap one of the RDDs which can't be known to retain the partitioning.