Support Questions

jestinm · ‎06-29-2016

I'm currently trying to join two DataFrames together but retain the same order in one of the Dataframes. From http://stackoverflow.com/questions/29284095/which-operations-preserve-rdd-order, it seems that (correct me if this is inaccurate because I'm new to Spark) joins do not preserve order because rows are joined / "arrive" at the final dataframe not in a specified order due to the data being in different partitions. How could one perform a join of two DataFrames while preserving the order of one table?

E.g.,

col1	col2
0	b
1	a

joined with

col2	col3
a	x
b	y

on col2 should give

col1	col2	col3
0	b	y
1	a	x

ordered using the first table. I've heard some things about using coalesce() or repartition(), but I'm not sure. Any suggestions/methods/insights are appreciated.

phargis · ‎06-29-2016

The methods you mention will not alter sort order for a join operation, since data is always shuffled for join. For ways to enforce sort order, you can read this post on HCC:

https://community.hortonworks.com/questions/42464/spark-dataframes-how-can-i-change-the-order-of-col...

To answer your questions about coalesce() and repartition(), these are both used to modify the # of partitions stored by the RDD. The repartition() method can increase or decrease the # of partitions, and allows shuffles across nodes, meaning data stored on one node can be moved to another. This makes it inefficient for large rdds. The coalesce() method can only be used to decrease the # of partitions, and shuffles are not allowed. This makes it more efficient than repartition, but it may result in asymmetric partitions since no data is moved across nodes.

View solution in original post

phargis · ‎06-29-2016

The methods you mention will not alter sort order for a join operation, since data is always shuffled for join. For ways to enforce sort order, you can read this post on HCC:

https://community.hortonworks.com/questions/42464/spark-dataframes-how-can-i-change-the-order-of-col...

To answer your questions about coalesce() and repartition(), these are both used to modify the # of partitions stored by the RDD. The repartition() method can increase or decrease the # of partitions, and allows shuffles across nodes, meaning data stored on one node can be moved to another. This makes it inefficient for large rdds. The coalesce() method can only be used to decrease the # of partitions, and shuffles are not allowed. This makes it more efficient than repartition, but it may result in asymmetric partitions since no data is moved across nodes.

Cloudera Community

Support Questions

Can Dataframe joins in Spark preserve order?

Spark RDDs vs DataFrames vs SparkSQL

Join Order from explain plan

Spark Dataframes: How can I change the order of co...

Spark 3 legacy configurations list ( Spark 2 behav...

Accessing Hbase tables and querying on Dataframes ...

Spark Python Supportability Matrix

TimestampType format for Spark DataFrames

Save Spark DataFrame table into Phoenix

Spark sort by key with descending order

Spark and Java versions Supportability Matrix