Member since
12-08-2016
1
Post
2
Kudos Received
0
Solutions
03-06-2017
04:53 AM
2 Kudos
We have a 6 node cluster where in we are trying to read csv files into a dataframe and save into ORC table, this was taking longer time than expected. We initially thought there is a problem with csv library that we are using(spark.csv datasource by databricks) to validate this we just changed the output format to parquet, and we got nearly 10 times performance difference , below is the action where we are inserting into hive table. DataFrame dataFrame = hiveContext.read().format("com.databricks.spark.csv")
.schema(structType).option("dateFormat", "yyyyMMdd HH:mm:ss").option("delimiter", "\t").load(paths) dataFrame.write().mode(SaveMode.Append).partitionBy(Constants.HIVE_PARTITION_COL)
.insertInto(properties.HiveTableName); We were able to load 15 million records to parquet table per minute however when we changed storage format to ORC performance drastically reduced to 2 million . HDP Version 2.4.3 We are actually in a fix as we chose ORC as storage format for our platform ,Any help in figuring the problem here is appreciated . Thanks in Adavance
... View more
Labels:
- Labels:
-
Apache Spark