Support Questions

Find answers, ask questions, and share your expertise

Merge small files in spark while writing into hive orc table

avatar
Explorer

I am reading lot of csv files s3 via Spark and writing into a hive table as orc. While writing, it is writing lot of small files. I need to merge all these files, i tried setting the property sqlContext.sql("set hive.merge.sparkfiles=true"). But this has no impact. Given below the code.

Please help.

sqlContext.sql("set hive.merge.sparkfiles=true")
sqlContext.sql("set hive.merge.smallfiles.avgsize=128000000")
sqlContext.sql("set hive.merge.size.per.task=128000000")
new_df.registerTempTable("new_df") sqlContext.sql("INSERT OVERWRITE TABLE db.tablename " +
"PARTITION(`week`='W_1617')" +
"SELECT col1, col2 FROM new_df"
4 REPLIES 4

avatar

Hi @Vijayalakshmi Ekambaram

If you are running any jobs in HDFS then by default it should be created as splits so that parallel processing can be benefited and work effectively. However it depends on the block size and mapred size set in the property. When writing the data into hive is it creating file splits less than the block size? If it is not, then that shouldn't be a problem.

After writing a file to hive table you can make use concatenate option to merge the small files.

Hope it helps!!

Use the below sample command:

ALTER TABLE tablename [PARTITION partition_spec] CONCATENATE; 

avatar
Expert Contributor

If you want to write in a single file, could you try to repartition `new_df` before making temp table?

new_df.repartition(1).registerTempTable("new_df")...

According to your situation, you may choose a different number of partitions instead of 1.

avatar
Explorer

Hi @Vijayalakshmi Ekambaram,

Spark default nature is to perform 200 partitions when doing aggregations , which is defined by the conf variable "spark.sql.shuffle.partitions" (default value is 200). This is the reason you will find lot of small files in the hive URI after each insert into hive table using Spark.

you can use the below "coalesce" statement to avoid writing many files into hive. (Coalesce won't perform full shuffle where as repartition does a full shuffle of data across network)

new_df.coalesce(2).write.mode("append").partitionBy("week").insertInto(db.tablename)


Note: The higher number is introduced to increase the amount of parallelism in Spark, which is useful when we have huge workload.

avatar
Explorer

Thank you for your replies. I am already following the above said methods. Was wondering if there is a way to use this property

hive.merge.sparkfiles=true whick takes of combining small files automatically.