Created 03-06-2017 04:53 AM
We have a 6 node cluster where in we are trying to read csv files into a dataframe and save into ORC table, this was taking longer time than expected. We initially thought there is a problem with csv library that we are using(spark.csv datasource by databricks) to validate this we just changed the output format to parquet, and we got nearly 10 times performance difference , below is the action where we are inserting into hive table.
DataFrame dataFrame = hiveContext.read().format("com.databricks.spark.csv") .schema(structType).option("dateFormat", "yyyyMMdd HH:mm:ss").option("delimiter", "\t").load(paths)
dataFrame.write().mode(SaveMode.Append).partitionBy(Constants.HIVE_PARTITION_COL) .insertInto(properties.HiveTableName);
We were able to load 15 million records to parquet table per minute however when we changed storage format to ORC performance drastically reduced to 2 million .
HDP Version 2.4.3
We are actually in a fix as we chose ORC as storage format for our platform ,Any help in figuring the problem here is appreciated .
Thanks in Adavance
Created 03-09-2017 09:08 PM
Apache Spark has traditionally worked sub-optimally with ORC because ORC used to be inside Apache Hive and Apache Spark depends on a very old release of Hive 1.2.1 from mid-2015. We are working on figuring out how to best update Apache Spark's version of ORC either by upgrading Apache Spark's dependency to latest Apache Hive or depending for ORC from the new Apache ORC project.
Created 03-09-2017 03:02 PM
Created 03-09-2017 09:08 PM
Apache Spark has traditionally worked sub-optimally with ORC because ORC used to be inside Apache Hive and Apache Spark depends on a very old release of Hive 1.2.1 from mid-2015. We are working on figuring out how to best update Apache Spark's version of ORC either by upgrading Apache Spark's dependency to latest Apache Hive or depending for ORC from the new Apache ORC project.
Created 12-19-2017 03:38 AM
Go to 2.6.3 and set spark.sql.hive.convertMetastoreOrc=true and spark.sql.orc.enabled=true.
You'll get a 3x 🙂