Support Questions

Find answers, ask questions, and share your expertise

Spark 2.2.0 not supporting sortBy

Contributor

Hi ,

I am currently using HDP version 2.6.3 and Spark version 2.2.0.2.6.3.0-235. When I am trying to do a sortBy while doing a dataframe write operation I am getting the below error

      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.2.0.2.6.3.0-235
      /_/
Using Python version 2.7.5 (default, Aug  4 2017 00:39:18)
SparkSession available as 'spark'.
>>> o1 = sqlContext.sql("select * from tttt limit 10000")
>>> o1.write.option("compression","zlib").mode("overwrite").format("orc").sortBy("dddd").saveAsTable("xxxx")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'DataFrameWriter' object has no attribute 'sortBy'

Per the Spark documentation sortBy is available from Spark 2.1.0.

Thanks,

Jayadeep

6 REPLIES 6

Expert Contributor

Hi, @Jayadeep Jayaraman .

As you see in the error message, `write` returns `DataFrameWriter`.

`sortBy` is supported in `Dataset/Dataframe`. Please try the following.

scala> spark.versionres7: String = 2.2.0.2.6.3.0-235
scala> df.sort("id").write.option("compression", "zlib").mode("overwrite").format("orc").saveAsTable("o2") scala> df.sort($"id".desc).write.option("compression", "zlib").mode("overwrite").format("orc").saveAsTable("o1")

Expert Contributor

BTW, Apache Spark doesn't guarantee reading on a sorted table. Spark reads the largest file first in the directory.

Contributor

Hi @Dongjoon Hyun thanks for the clarification but if you look at the official documentation

https://spark.apache.org/docs/2.2.0/sql-programming-guide.html then you will see that Spark 2.2.x now supports sortBy and bucketBy operations.

peopleDF.write.bucketBy(42, "name").sortBy("age").saveAsTable("people_bucketed")

Expert Contributor

Oh. My bad. I didn't try your command. You're right. For me, it works like the following in HDP 2.6.3 Scala Spark.

scala> spark.version
res5: String = 2.2.0.2.6.3.0-235

scala>  Seq((1,2),(3,4)).toDF("a", "b").write.option("compression","zlib").mode("overwrite").format("orc").bucketBy(10, "a").sortBy("b").saveAsTable("xx")

scala> sql("select * from xx").show
+---+---+
|  a|  b|
+---+---+
|  3|  4|
|  1|  2|
+---+---+
<br>

Expert Contributor

For python, I checked the code.

- PySpark 2.2 doesn't have that.

https://github.com/apache/spark/blob/branch-2.2/python/pyspark/sql/readwriter.py

- It's added in PySpark 2.3.

https://github.com/apache/spark/blob/master/python/pyspark/sql/readwriter.py#L648-L649

I think you found a documentation error here.

New Contributor

same problem with spark 2.2.1.