Support Questions

benakaraj_karba · ‎03-06-2017

We have a 6 node cluster where in we are trying to read csv files into a dataframe and save into ORC table, this was taking longer time than expected. We initially thought there is a problem with csv library that we are using(spark.csv datasource by databricks) to validate this we just changed the output format to parquet, and we got nearly 10 times performance difference , below is the action where we are inserting into hive table.

DataFrame dataFrame = hiveContext.read().format("com.databricks.spark.csv") .schema(structType).option("dateFormat", "yyyyMMdd HH:mm:ss").option("delimiter", "\t").load(paths)

dataFrame.write().mode(SaveMode.Append).partitionBy(Constants.HIVE_PARTITION_COL) .insertInto(properties.HiveTableName);

We were able to load 15 million records to parquet table per minute however when we changed storage format to ORC performance drastically reduced to 2 million .

HDP Version 2.4.3

We are actually in a fix as we chose ORC as storage format for our platform ,Any help in figuring the problem here is appreciated .

Thanks in Adavance

bikas · ‎03-09-2017

Apache Spark has traditionally worked sub-optimally with ORC because ORC used to be inside Apache Hive and Apache Spark depends on a very old release of Hive 1.2.1 from mid-2015. We are working on figuring out how to best update Apache Spark's version of ORC either by upgrading Apache Spark's dependency to latest Apache Hive or depending for ORC from the new Apache ORC project.

View solution in original post

azeltov · ‎03-09-2017

@vshukla @Bikas

bikas · ‎03-09-2017

Apache Spark has traditionally worked sub-optimally with ORC because ORC used to be inside Apache Hive and Apache Spark depends on a very old release of Hive 1.2.1 from mid-2015. We are working on figuring out how to best update Apache Spark's version of ORC either by upgrading Apache Spark's dependency to latest Apache Hive or depending for ORC from the new Apache ORC project.

james1 · ‎12-19-2017

@Benakaraj KS

Go to 2.6.3 and set spark.sql.hive.convertMetastoreOrc=true and spark.sql.orc.enabled=true.

You'll get a 3x 🙂

Cloudera Community

Support Questions

Dataframe Insert into ORC table is slow compared to Parquet table