Support Questions

Find answers, ask questions, and share your expertise

Can Dataframe joins in Spark preserve order?

avatar
Contributor

I'm currently trying to join two DataFrames together but retain the same order in one of the Dataframes. From http://stackoverflow.com/questions/29284095/which-operations-preserve-rdd-order, it seems that (correct me if this is inaccurate because I'm new to Spark) joins do not preserve order because rows are joined / "arrive" at the final dataframe not in a specified order due to the data being in different partitions. How could one perform a join of two DataFrames while preserving the order of one table?

E.g.,

col1col2
0b
1a

joined with

col2col3
ax
by

on col2 should give

col1col2col3
0by
1ax

ordered using the first table. I've heard some things about using coalesce() or repartition(), but I'm not sure. Any suggestions/methods/insights are appreciated.

1 ACCEPTED SOLUTION

avatar

The methods you mention will not alter sort order for a join operation, since data is always shuffled for join. For ways to enforce sort order, you can read this post on HCC:

https://community.hortonworks.com/questions/42464/spark-dataframes-how-can-i-change-the-order-of-col...

To answer your questions about coalesce() and repartition(), these are both used to modify the # of partitions stored by the RDD. The repartition() method can increase or decrease the # of partitions, and allows shuffles across nodes, meaning data stored on one node can be moved to another. This makes it inefficient for large rdds. The coalesce() method can only be used to decrease the # of partitions, and shuffles are not allowed. This makes it more efficient than repartition, but it may result in asymmetric partitions since no data is moved across nodes.

View solution in original post

1 REPLY 1

avatar

The methods you mention will not alter sort order for a join operation, since data is always shuffled for join. For ways to enforce sort order, you can read this post on HCC:

https://community.hortonworks.com/questions/42464/spark-dataframes-how-can-i-change-the-order-of-col...

To answer your questions about coalesce() and repartition(), these are both used to modify the # of partitions stored by the RDD. The repartition() method can increase or decrease the # of partitions, and allows shuffles across nodes, meaning data stored on one node can be moved to another. This makes it inefficient for large rdds. The coalesce() method can only be used to decrease the # of partitions, and shuffles are not allowed. This makes it more efficient than repartition, but it may result in asymmetric partitions since no data is moved across nodes.