Created 06-29-2016 07:35 PM
I'm currently trying to join two DataFrames together but retain the same order in one of the Dataframes. From http://stackoverflow.com/questions/29284095/which-operations-preserve-rdd-order, it seems that (correct me if this is inaccurate because I'm new to Spark) joins do not preserve order because rows are joined / "arrive" at the final dataframe not in a specified order due to the data being in different partitions. How could one perform a join of two DataFrames while preserving the order of one table?
E.g.,
col1 | col2 |
0 | b |
1 | a |
joined with
col2 | col3 |
a | x |
b | y |
on col2 should give
col1 | col2 | col3 |
0 | b | y |
1 | a | x |
ordered using the first table. I've heard some things about using coalesce() or repartition(), but I'm not sure. Any suggestions/methods/insights are appreciated.
Created 06-29-2016 08:25 PM
The methods you mention will not alter sort order for a join operation, since data is always shuffled for join. For ways to enforce sort order, you can read this post on HCC:
To answer your questions about coalesce() and repartition(), these are both used to modify the # of partitions stored by the RDD. The repartition() method can increase or decrease the # of partitions, and allows shuffles across nodes, meaning data stored on one node can be moved to another. This makes it inefficient for large rdds. The coalesce() method can only be used to decrease the # of partitions, and shuffles are not allowed. This makes it more efficient than repartition, but it may result in asymmetric partitions since no data is moved across nodes.
Created 06-29-2016 08:25 PM
The methods you mention will not alter sort order for a join operation, since data is always shuffled for join. For ways to enforce sort order, you can read this post on HCC:
To answer your questions about coalesce() and repartition(), these are both used to modify the # of partitions stored by the RDD. The repartition() method can increase or decrease the # of partitions, and allows shuffles across nodes, meaning data stored on one node can be moved to another. This makes it inefficient for large rdds. The coalesce() method can only be used to decrease the # of partitions, and shuffles are not allowed. This makes it more efficient than repartition, but it may result in asymmetric partitions since no data is moved across nodes.