Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Spark DataFrame - difference between sort and orderBy functions?

avatar

Just wanted to understand if there is any functional difference on how sort and orderBy functions on DataFrame works.

Can it be compared to total order sorting across all partitioner outputs or sorting on data within each partition and no guarantee of total order sorting ?

Based on the clarifications, I would like to know the usage of both the functions.

1 ACCEPTED SOLUTION

avatar

@Dinesh Chitlangia

OrderBy is just an alias for the Sort function and should give the same result.

The below is from the Spark documentation:

/**
   * Returns a new Dataset sorted by the given expressions.
   * This is an alias of the `sort` function.
   *
   * @group typedrel
   * @since 2.0.0
   */
  @scala.annotation.varargs
  def orderBy(sortCol: String, sortCols: String*): Dataset[T] = sort(sortCol, sortCols : _*)

Both will order across partitions. To get an understanding of how Spark does a sort take a look at the explanation in the link below:

http://stackoverflow.com/questions/32887595/how-does-spark-achieve-sort-order

If you would like to sort within a partition then you can use repartitionAndSortWithinPartitions.

https://spark.apache.org/docs/1.6.0/api/java/org/apache/spark/rdd/OrderedRDDFunctions.html#repartiti...

View solution in original post

5 REPLIES 5

avatar

@Dinesh Chitlangia

OrderBy is just an alias for the Sort function and should give the same result.

The below is from the Spark documentation:

/**
   * Returns a new Dataset sorted by the given expressions.
   * This is an alias of the `sort` function.
   *
   * @group typedrel
   * @since 2.0.0
   */
  @scala.annotation.varargs
  def orderBy(sortCol: String, sortCols: String*): Dataset[T] = sort(sortCol, sortCols : _*)

Both will order across partitions. To get an understanding of how Spark does a sort take a look at the explanation in the link below:

http://stackoverflow.com/questions/32887595/how-does-spark-achieve-sort-order

If you would like to sort within a partition then you can use repartitionAndSortWithinPartitions.

https://spark.apache.org/docs/1.6.0/api/java/org/apache/spark/rdd/OrderedRDDFunctions.html#repartiti...

avatar

@Dinesh Chitlangia

Sort and orderBy are same when spark is considered. It functions/works on the same way in spark. However in Hive or any other DB the function is quite different. If you want to know differences in hive then refer the below link

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+SortBy

avatar
New Contributor

Sort & orderBy are same in spark.

OrderBy is an alias for sort in DataSets API-

https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/Dataset.sca...

avatar
New Contributor

Sort & orderBy are same in spark.

OrderBy is an alias for sort in DataSets-

https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/Dataset.sca...

avatar
New Contributor

They are actually not the same.

SORT BY sorts data inside partition, while ORDER BY is global sort.

SORT BY calls sortWithinPartitions() function, while ORDER BY calls sort()

Both of these functions call sortInternal(), but with different global flag:
def sortWithinPartitions ...
sortInternal(global = false, sortExprs)

 

def sort ...
sortInternal(global = true, sortExprs)