Created 01-30-2018 09:48 AM
Hi ,
I am currently using HDP version 2.6.3 and Spark version 2.2.0.2.6.3.0-235. When I am trying to do a sortBy while doing a dataframe write operation I am getting the below error
____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 2.2.0.2.6.3.0-235 /_/ Using Python version 2.7.5 (default, Aug 4 2017 00:39:18) SparkSession available as 'spark'. >>> o1 = sqlContext.sql("select * from tttt limit 10000") >>> o1.write.option("compression","zlib").mode("overwrite").format("orc").sortBy("dddd").saveAsTable("xxxx") Traceback (most recent call last): File "<stdin>", line 1, in <module> AttributeError: 'DataFrameWriter' object has no attribute 'sortBy'
Per the Spark documentation sortBy is available from Spark 2.1.0.
Thanks,
Jayadeep
Created 01-30-2018 06:55 PM
Hi, @Jayadeep Jayaraman .
As you see in the error message, `write` returns `DataFrameWriter`.
`sortBy` is supported in `Dataset/Dataframe`. Please try the following.
scala> spark.versionres7: String = 2.2.0.2.6.3.0-235
scala> df.sort("id").write.option("compression", "zlib").mode("overwrite").format("orc").saveAsTable("o2") scala> df.sort($"id".desc).write.option("compression", "zlib").mode("overwrite").format("orc").saveAsTable("o1")
Created 01-30-2018 06:57 PM
BTW, Apache Spark doesn't guarantee reading on a sorted table. Spark reads the largest file first in the directory.
Created 01-31-2018 02:11 AM
Hi @Dongjoon Hyun thanks for the clarification but if you look at the official documentation
https://spark.apache.org/docs/2.2.0/sql-programming-guide.html then you will see that Spark 2.2.x now supports sortBy and bucketBy operations.
peopleDF.write.bucketBy(42, "name").sortBy("age").saveAsTable("people_bucketed")
Created 01-31-2018 03:29 AM
Oh. My bad. I didn't try your command. You're right. For me, it works like the following in HDP 2.6.3 Scala Spark.
scala> spark.version res5: String = 2.2.0.2.6.3.0-235 scala> Seq((1,2),(3,4)).toDF("a", "b").write.option("compression","zlib").mode("overwrite").format("orc").bucketBy(10, "a").sortBy("b").saveAsTable("xx") scala> sql("select * from xx").show +---+---+ | a| b| +---+---+ | 3| 4| | 1| 2| +---+---+ <br>
Created 01-31-2018 03:43 AM
For python, I checked the code.
- PySpark 2.2 doesn't have that.
https://github.com/apache/spark/blob/branch-2.2/python/pyspark/sql/readwriter.py
- It's added in PySpark 2.3.
https://github.com/apache/spark/blob/master/python/pyspark/sql/readwriter.py#L648-L649
I think you found a documentation error here.
Created 02-21-2018 04:12 PM
same problem with spark 2.2.1.