Created 09-15-2017 01:38 PM
I am reading lot of csv files s3 via Spark and writing into a hive table as orc. While writing, it is writing lot of small files. I need to merge all these files, i tried setting the property sqlContext.sql("set hive.merge.sparkfiles=true"). But this has no impact. Given below the code.
Please help.
Created 09-15-2017 05:12 PM
If you are running any jobs in HDFS then by default it should be created as splits so that parallel processing can be benefited and work effectively. However it depends on the block size and mapred size set in the property. When writing the data into hive is it creating file splits less than the block size? If it is not, then that shouldn't be a problem.
After writing a file to hive table you can make use concatenate option to merge the small files.
Hope it helps!!
Use the below sample command:
ALTER TABLE tablename [PARTITION partition_spec] CONCATENATE;
Created 09-15-2017 07:24 PM
If you want to write in a single file, could you try to repartition `new_df` before making temp table?
new_df.repartition(1).registerTempTable("new_df")...
According to your situation, you may choose a different number of partitions instead of 1.
Created 09-16-2017 06:31 AM
Spark default nature is to perform 200 partitions when doing aggregations , which is defined by the conf variable "spark.sql.shuffle.partitions" (default value is 200). This is the reason you will find lot of small files in the hive URI after each insert into hive table using Spark.
you can use the below "coalesce" statement to avoid writing many files into hive. (Coalesce won't perform full shuffle where as repartition does a full shuffle of data across network)
new_df.coalesce(2).write.mode("append").partitionBy("week").insertInto(db.tablename) |
Note: The higher number is introduced to increase the amount of parallelism in Spark, which is useful when we have huge workload.
Created 09-18-2017 05:08 AM
Thank you for your replies. I am already following the above said methods. Was wondering if there is a way to use this property
hive.merge.sparkfiles=true whick takes of combining small files automatically.