Created on 11-20-2017 10:11 AM
Since Apache Spark 1.4.1, Spark supports ORC as one of its FileFormat
. This article introduces how to use another faster ORC file format with Apache Spark 2.2 in HDP 2.6.3. First, in order to show how to choose a FileFormat
, Section 1.2 will show an example which writes and reads with ORCFileFormat
. Section 2 shows a brief performance comparison, and Section 3 explains more use cases and ORC configurations. Section 4 summarizes ORC-related Apache Spark fixes included in HDP 2.6.3.
%spark2.spark // Save 5 rows into an ORC file. spark.range(5).write.format("orc").mode("overwrite").save("/tmp/orc") // Read a DataFrame from the ORC file with the existing ORCFileFormat. spark.read.format("orc").load("/tmp/orc").count // Read a DataFrame from the ORC file with a new ORCFileFormat. spark.read.format("org.apache.spark.sql.execution.datasources.orc").load("/tmp/orc").count
res4: Long = 5 res7: Long = 5
Here, I’ll show you a small and quick performance comparison to show difference. For TPCDS 10TB performance comparison, please refer the presentation at DataWorks Summit in Reference section.
%spark2.spark val df = spark.range(200000000).sample(true, 0.5) df.write.format("orc").mode("overwrite").save("/tmp/orc_100m")
df: org.apache.spark.sql.Dataset[Long] = [id: bigint]
%spark2.spark // New ORC file format spark.time(spark.read.format("org.apache.spark.sql.execution.datasources.orc").load("/tmp/orc_100m").count) // Old ORC file format spark.time(spark.read.format("orc").load("/tmp/orc_100m").count)
Time taken: 345 ms res10: Long = 100000182 Time taken: 3518 ms res12: Long = 100000182
New ORC file format in HDP 2.6.3, org.apache.spark.sql.execution.datasources.orc
, is faster than old ORC file format. The performance difference comes from vectorization. Apache Spark has ColumnarBatch
and Apache ORC has RowBatch
separately. By combining these two vectorization techniques, we achieved the performance gain like the above. Previously, Apache Spark took advantages of its ColumnarBatch
format with Apache Parquet only.
In addition, Apache Spark community has been putting efforts on SPARK-20901 Feature parity for ORC with Parquet. Recently, with new Apache ORC 1.4.1 (released 16th Oct), Spark becomes more stable and faster.
%spark2.spark sql("SET spark.sql.orc.enabled=true") spark.time(spark.read.format("orc").load("/tmp/orc_100m").count) sql("SET spark.sql.orc.enabled=false") spark.time(spark.read.format("orc").load("/tmp/orc_100m").count)
res13: org.apache.spark.sql.DataFrame = [key: string, value: string] Time taken: 273 ms res14: Long = 100000182 res16: org.apache.spark.sql.DataFrame = [key: string, value: string] Time taken: 4083 ms res17: Long = 100000182
%spark2.spark df.write.format("orc").mode("overwrite").saveAsTable("t1") df.write.format("orc").mode("overwrite").saveAsTable("t2") sql("SET spark.sql.orc.enabled=true") spark.time(sql("SELECT COUNT(*) FROM t1").collect) sql("SET spark.sql.orc.enabled=false") spark.time(sql("SELECT COUNT(*) FROM t2").collect)
res21: org.apache.spark.sql.DataFrame = [key: string, value: string] Time taken: 404 ms res22: Array[org.apache.spark.sql.Row] = Array([100000182]) res24: org.apache.spark.sql.DataFrame = [key: string, value: string] Time taken: 4333 ms res25: Array[org.apache.spark.sql.Row] = Array([100000182])
%spark2.spark sql("DROP TABLE IF EXISTS o1") sql("CREATE TABLE o1 USING `org.apache.spark.sql.execution.datasources.orc` AS SELECT * FROM t1") sql("SET spark.sql.orc.enabled=false") spark.time(sql("SELECT COUNT(*) FROM o1").collect)
res26: org.apache.spark.sql.DataFrame = [] res27: org.apache.spark.sql.DataFrame = [] res28: org.apache.spark.sql.DataFrame = [key: string, value: string] Time taken: 213 ms res29: Array[org.apache.spark.sql.Row] = Array([100000182])
%spark2.spark sql("DROP TABLE IF EXISTS h1") sql("CREATE TABLE h1 STORED AS ORC AS SELECT * FROM t1") sql("SET spark.sql.hive.convertMetastoreOrc=true") sql("SET spark.sql.orc.enabled=true") spark.time(sql("SELECT COUNT(*) FROM h1").collect)
res30: org.apache.spark.sql.DataFrame = [] res31: org.apache.spark.sql.DataFrame = [] res33: org.apache.spark.sql.DataFrame = [key: string, value: string] res34: org.apache.spark.sql.DataFrame = [key: string, value: string] Time taken: 227 ms res35: Array[org.apache.spark.sql.Row] = Array([100000182])
To utilize new ORC file format, there are more ORC configurations which you should turn on.
The followings are the summary of recommended ORC configurations in HDP 2.6.3 and the above.
spark.sql.orc.enabled=true
enables new ORC format to read/write DataSource Tables and files.spark.sql.hive.convertMetastoreOrc=true
enables new ORC format to read/write Hive Tables.spark.sql.orc.filterPushdown=true
enables filter pushdown for ORC formats.spark.sql.orc.char.enabled=true
enables new ORC format to use CHAR types to read Hive Tables.spark.sql.hive.convertMetastoreOrc
config in HiveCompatibilitySuiteALTER TABLE table_name ADD COLUMNS(..)
for ORC data source HDP 2.6.3 provides a powerful combination of Apache Spark 2.2 and Apache ORC 1.4.1 as a technical preview.
In Apache Spark community, SPARK-20901 Feature parity for ORC with Parquet is still on-going efforts. We are looking forward to seeing more improvements in Apache Spark 2.3.
%spark2.spark sql("SET spark.sql.hive.convertMetastoreOrc=false") sql("SET spark.sql.orc.enabled=false") sql("SET spark.sql.orc.filterPushdown=false") sql("SET spark.sql.orc.char.enabled=false")
res36: org.apache.spark.sql.DataFrame = [key: string, value: string] res37: org.apache.spark.sql.DataFrame = [key: string, value: string] res38: org.apache.spark.sql.DataFrame = [key: string, value: string] res39: org.apache.spark.sql.DataFrame = [key: string, value: string]
Created on 05-02-2018 06:02 PM
While creating the table with `org.apache.spark.sql.execution.datasources.orc`we see that the SerDe properties set are haywire and not ORC. Do we have to explicitly fix that? Also does this work seamlessly on Spark 2.2?
Created on 08-24-2018 08:28 AM
Hi @Dongjoon Hyun
How to add this dependency in build.sbt ? coz im using Spark 2.2.1 which is showing following Errorjava.lang.ClassNotFoundException: org.apache.spark.sql.execution.datasources.orc.DefaultSource.
Created on 08-24-2018 08:29 PM
@Manikandan Jeyabal. Are you using the official Apache Spark? New ORC vectorized reader is added at Apache Spark 2.3.0. Please see SPARK-16060.
Created on 01-15-2019 12:03 AM
Is it possible to use the new orc format on exactly Apache Spark 2.2.0?