Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Dataframe Insert into ORC table is slow compared to Parquet table

Solved Go to solution

Dataframe Insert into ORC table is slow compared to Parquet table

New Contributor

We have a 6 node cluster where in we are trying to read csv files into a dataframe and save into ORC table, this was taking longer time than expected. We initially thought there is a problem with csv library that we are using(spark.csv datasource by databricks) to validate this we just changed the output format to parquet, and we got nearly 10 times performance difference , below is the action where we are inserting into hive table.

DataFrame dataFrame = hiveContext.read().format("com.databricks.spark.csv") .schema(structType).option("dateFormat", "yyyyMMdd HH:mm:ss").option("delimiter", "\t").load(paths)

dataFrame.write().mode(SaveMode.Append).partitionBy(Constants.HIVE_PARTITION_COL) .insertInto(properties.HiveTableName);

We were able to load 15 million records to parquet table per minute however when we changed storage format to ORC performance drastically reduced to 2 million .

HDP Version 2.4.3

We are actually in a fix as we chose ORC as storage format for our platform ,Any help in figuring the problem here is appreciated .

Thanks in Adavance

1 ACCEPTED SOLUTION

Accepted Solutions
Highlighted

Re: Dataframe Insert into ORC table is slow compared to Parquet table

Expert Contributor

Apache Spark has traditionally worked sub-optimally with ORC because ORC used to be inside Apache Hive and Apache Spark depends on a very old release of Hive 1.2.1 from mid-2015. We are working on figuring out how to best update Apache Spark's version of ORC either by upgrading Apache Spark's dependency to latest Apache Hive or depending for ORC from the new Apache ORC project.

View solution in original post

3 REPLIES 3
Highlighted

Re: Dataframe Insert into ORC table is slow compared to Parquet table

Highlighted

Re: Dataframe Insert into ORC table is slow compared to Parquet table

Expert Contributor

Apache Spark has traditionally worked sub-optimally with ORC because ORC used to be inside Apache Hive and Apache Spark depends on a very old release of Hive 1.2.1 from mid-2015. We are working on figuring out how to best update Apache Spark's version of ORC either by upgrading Apache Spark's dependency to latest Apache Hive or depending for ORC from the new Apache ORC project.

View solution in original post

Highlighted

Re: Dataframe Insert into ORC table is slow compared to Parquet table

Explorer

@Benakaraj KS

Go to 2.6.3 and set spark.sql.hive.convertMetastoreOrc=true and spark.sql.orc.enabled=true.

You'll get a 3x :)

Don't have an account?
Coming from Hortonworks? Activate your account here