Support Questions
Find answers, ask questions, and share your expertise

Spark Dataframe sorting not working on cluster

Explorer

We are facing certain challenges in sorting of data on dataframes in Spark 1.6 . We are using df. orderBy(userColumn, rankColumn). The sorting of data is proper when the dataframe data is in one partition. As soon as the partition size increases , the dataframe sorting is not working on clustered environment. We tried Distribute by and sort by approach as well as per the below post: http://saurzcode.in/2015/01/hive-sort-vs-order-vs-distribute-vs-cluster/. This is also not working. Please suggest.

1 REPLY 1

Explorer

Input: df.show()

userColumn rankColumn

U5 5

U6 1

U1 1

U1 2

U5 4

U5 2

U2 4

U3 1

df = df.orderBy(userColumn, rankColumn)

df.show()

Expected Output:

userColumn rankColumn

U1 1

U1 2

U2 4

U3 1

U5 2

U5 4

U5 5

U6 1

Actual Output(if spark puts all data in one partition):

userColumn rankColumn

U1 1

U1 2

U2 4

U3 1

U5 2

U5 4

U5 5

U6 1

Actual Output(if spark does not put all data in one partition):

U1 2

U1 1

U3 1

U2 4

U6 1

U5 2

U5 4

U5 5

Please let me know if you need any other details.