Member since
12-29-2017
2
Posts
0
Kudos Received
0
Solutions
01-02-2018
08:36 AM
@Matt Andruff
Thanks for responding. The question is not about difference between
SaveMode.Append and SaveMode.Overwrite, I am concern about the spark job
failure which happens due to multiple execution of same task while using repartition/coalesce.
... View more
12-29-2017
06:40 AM
I am reading csv files from s3 and writing into a hive table
as orc. While writing, it is writing lot of small files. I need to
merge all these files. I have following properties set: spark.sql("SET hive.merge.sparkfiles = true")
spark.sql("SET hive.merge.mapredfiles = true")
spark.sql("SET hive.merge.mapfiles = true")
spark.sql("set hive.merge.smallfiles.avgsize = 128000000")
spark.sql("set hive.merge.size.per.task = 128000000") apart from these configurations I tried repartition(1) and coalesce(1) which does the merge into single file but it deletes the hive table and creates it again. masterFile.repartition(1).write.mode(SaveMode.Overwrite).partitionBy(<partitioncolumn>).orc(<HIVEtbl>) If I use Append mode instead of Overwrite it creates duplication files under each partition. masterFile.repartition(1).write.mode(SaveMode.Append).partitionBy(<partitioncolumn>).orc(<HIVEtbl>) In both cases the spark job runs twice and fails in second execution. Is there any way that I can use repartition/coalesce with Append mode without duplication of part file in each partition?
... View more
Labels:
- Labels:
-
Apache Hive
-
Apache Spark