About benakaraj_karba

benakaraj_karba · ‎03-06-2017

We have a 6 node cluster where in we are trying to read csv files into a dataframe and save into ORC table, this was taking longer time than expected. We initially thought there is a problem with csv library that we are using(spark.csv datasource by databricks) to validate this we just changed the output format to parquet, and we got nearly 10 times performance difference , below is the action where we are inserting into hive table. DataFrame dataFrame = hiveContext.read().format("com.databricks.spark.csv") .schema(structType).option("dateFormat", "yyyyMMdd HH:mm:ss").option("delimiter", "\t").load(paths) dataFrame.write().mode(SaveMode.Append).partitionBy(Constants.HIVE_PARTITION_COL) .insertInto(properties.HiveTableName); We were able to load 15 million records to parquet table per minute however when we changed storage format to ORC performance drastically reduced to 2 million . HDP Version 2.4.3 We are actually in a fix as we chose ORC as storage format for our platform ,Any help in figuring the problem here is appreciated . Thanks in Adavance

Online	Offline
Last Visited	‎04-11-2017 03:41 AM

Member Since	‎12-08-2016 12:25 PM
Last Visited	‎04-11-2017 03:41 AM
Posts	1
Kudos received	2

Cloudera Community

Dataframe Insert into ORC table is slow compared t...