Support Questions
Find answers, ask questions, and share your expertise

Spark DataFrame - difference between sort and orderBy functions?

Solved Go to solution

Spark DataFrame - difference between sort and orderBy functions?

Just wanted to understand if there is any functional difference on how sort and orderBy functions on DataFrame works.

Can it be compared to total order sorting across all partitioner outputs or sorting on data within each partition and no guarantee of total order sorting ?

Based on the clarifications, I would like to know the usage of both the functions.

1 ACCEPTED SOLUTION

Accepted Solutions
Highlighted

Re: Spark DataFrame - difference between sort and orderBy functions?

@Dinesh Chitlangia

OrderBy is just an alias for the Sort function and should give the same result.

The below is from the Spark documentation:

/**
   * Returns a new Dataset sorted by the given expressions.
   * This is an alias of the `sort` function.
   *
   * @group typedrel
   * @since 2.0.0
   */
  @scala.annotation.varargs
  def orderBy(sortCol: String, sortCols: String*): Dataset[T] = sort(sortCol, sortCols : _*)

Both will order across partitions. To get an understanding of how Spark does a sort take a look at the explanation in the link below:

http://stackoverflow.com/questions/32887595/how-does-spark-achieve-sort-order

If you would like to sort within a partition then you can use repartitionAndSortWithinPartitions.

https://spark.apache.org/docs/1.6.0/api/java/org/apache/spark/rdd/OrderedRDDFunctions.html#repartiti...

View solution in original post

5 REPLIES 5
Highlighted

Re: Spark DataFrame - difference between sort and orderBy functions?

@Dinesh Chitlangia

OrderBy is just an alias for the Sort function and should give the same result.

The below is from the Spark documentation:

/**
   * Returns a new Dataset sorted by the given expressions.
   * This is an alias of the `sort` function.
   *
   * @group typedrel
   * @since 2.0.0
   */
  @scala.annotation.varargs
  def orderBy(sortCol: String, sortCols: String*): Dataset[T] = sort(sortCol, sortCols : _*)

Both will order across partitions. To get an understanding of how Spark does a sort take a look at the explanation in the link below:

http://stackoverflow.com/questions/32887595/how-does-spark-achieve-sort-order

If you would like to sort within a partition then you can use repartitionAndSortWithinPartitions.

https://spark.apache.org/docs/1.6.0/api/java/org/apache/spark/rdd/OrderedRDDFunctions.html#repartiti...

View solution in original post

Highlighted

Re: Spark DataFrame - difference between sort and orderBy functions?

@Dinesh Chitlangia

Sort and orderBy are same when spark is considered. It functions/works on the same way in spark. However in Hive or any other DB the function is quite different. If you want to know differences in hive then refer the below link

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+SortBy

Highlighted

Re: Spark DataFrame - difference between sort and orderBy functions?

New Contributor

Sort & orderBy are same in spark.

OrderBy is an alias for sort in DataSets API-

https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/Dataset.sca...

Highlighted

Re: Spark DataFrame - difference between sort and orderBy functions?

New Contributor

Sort & orderBy are same in spark.

OrderBy is an alias for sort in DataSets-

https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/Dataset.sca...

Highlighted

Re: Spark DataFrame - difference between sort and orderBy functions?

New Contributor

They are actually not the same.

SORT BY sorts data inside partition, while ORDER BY is global sort.

SORT BY calls sortWithinPartitions() function, while ORDER BY calls sort()

Both of these functions call sortInternal(), but with different global flag:
def sortWithinPartitions ...
sortInternal(global = false, sortExprs)

 

def sort ...
sortInternal(global = true, sortExprs)

Don't have an account?