About farkare

farkare · ‎01-02-2018

@Matt Andruff Thanks for responding. The question is not about difference between SaveMode.Append and SaveMode.Overwrite, I am concern about the spark job failure which happens due to multiple execution of same task while using repartition/coalesce.

farkare · ‎12-29-2017

I am reading csv files from s3 and writing into a hive table as orc. While writing, it is writing lot of small files. I need to merge all these files. I have following properties set: spark.sql("SET hive.merge.sparkfiles = true") spark.sql("SET hive.merge.mapredfiles = true") spark.sql("SET hive.merge.mapfiles = true") spark.sql("set hive.merge.smallfiles.avgsize = 128000000") spark.sql("set hive.merge.size.per.task = 128000000") apart from these configurations I tried repartition(1) and coalesce(1) which does the merge into single file but it deletes the hive table and creates it again. masterFile.repartition(1).write.mode(SaveMode.Overwrite).partitionBy(<partitioncolumn>).orc(<HIVEtbl>) If I use Append mode instead of Overwrite it creates duplication files under each partition. masterFile.repartition(1).write.mode(SaveMode.Append).partitionBy(<partitioncolumn>).orc(<HIVEtbl>) In both cases the spark job runs twice and fails in second execution. Is there any way that I can use repartition/coalesce with Append mode without duplication of part file in each partition?

Online	Offline
Last Visited	‎01-02-2018 08:36 AM

Member Since	‎12-29-2017 06:39 AM
Last Visited	‎01-02-2018 08:36 AM
Posts	2

Cloudera Community

Re: How to merge small files in spark while writin...

How to merge small files in spark while writing in...