Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Dataframe Insert into ORC table is slow compared to Parquet table

avatar
New Contributor

We have a 6 node cluster where in we are trying to read csv files into a dataframe and save into ORC table, this was taking longer time than expected. We initially thought there is a problem with csv library that we are using(spark.csv datasource by databricks) to validate this we just changed the output format to parquet, and we got nearly 10 times performance difference , below is the action where we are inserting into hive table.

DataFrame dataFrame = hiveContext.read().format("com.databricks.spark.csv") .schema(structType).option("dateFormat", "yyyyMMdd HH:mm:ss").option("delimiter", "\t").load(paths)

dataFrame.write().mode(SaveMode.Append).partitionBy(Constants.HIVE_PARTITION_COL) .insertInto(properties.HiveTableName);

We were able to load 15 million records to parquet table per minute however when we changed storage format to ORC performance drastically reduced to 2 million .

HDP Version 2.4.3

We are actually in a fix as we chose ORC as storage format for our platform ,Any help in figuring the problem here is appreciated .

Thanks in Adavance

1 ACCEPTED SOLUTION

avatar
Super Collaborator

Apache Spark has traditionally worked sub-optimally with ORC because ORC used to be inside Apache Hive and Apache Spark depends on a very old release of Hive 1.2.1 from mid-2015. We are working on figuring out how to best update Apache Spark's version of ORC either by upgrading Apache Spark's dependency to latest Apache Hive or depending for ORC from the new Apache ORC project.

View solution in original post

3 REPLIES 3

avatar

avatar
Super Collaborator

Apache Spark has traditionally worked sub-optimally with ORC because ORC used to be inside Apache Hive and Apache Spark depends on a very old release of Hive 1.2.1 from mid-2015. We are working on figuring out how to best update Apache Spark's version of ORC either by upgrading Apache Spark's dependency to latest Apache Hive or depending for ORC from the new Apache ORC project.

avatar
Rising Star

@Benakaraj KS

Go to 2.6.3 and set spark.sql.hive.convertMetastoreOrc=true and spark.sql.orc.enabled=true.

You'll get a 3x 🙂