- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
Merge small files in spark while writing into hive orc table
- Labels:
-
Apache Hive
-
Apache Spark
Created 09-15-2017 01:38 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I am reading lot of csv files s3 via Spark and writing into a hive table as orc. While writing, it is writing lot of small files. I need to merge all these files, i tried setting the property sqlContext.sql("set hive.merge.sparkfiles=true"). But this has no impact. Given below the code.
Please help.
sqlContext.sql("set hive.merge.smallfiles.avgsize=128000000")
sqlContext.sql("set hive.merge.size.per.task=128000000")
new_df.registerTempTable("new_df") sqlContext.sql("INSERT OVERWRITE TABLE db.tablename " +
Created 09-15-2017 05:12 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
If you are running any jobs in HDFS then by default it should be created as splits so that parallel processing can be benefited and work effectively. However it depends on the block size and mapred size set in the property. When writing the data into hive is it creating file splits less than the block size? If it is not, then that shouldn't be a problem.
After writing a file to hive table you can make use concatenate option to merge the small files.
Hope it helps!!
Use the below sample command:
ALTER TABLE tablename [PARTITION partition_spec] CONCATENATE;
Created 09-15-2017 07:24 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
If you want to write in a single file, could you try to repartition `new_df` before making temp table?
new_df.repartition(1).registerTempTable("new_df")...
According to your situation, you may choose a different number of partitions instead of 1.
Created 09-16-2017 06:31 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Spark default nature is to perform 200 partitions when doing aggregations , which is defined by the conf variable "spark.sql.shuffle.partitions" (default value is 200). This is the reason you will find lot of small files in the hive URI after each insert into hive table using Spark.
you can use the below "coalesce" statement to avoid writing many files into hive. (Coalesce won't perform full shuffle where as repartition does a full shuffle of data across network)
new_df.coalesce(2).write.mode("append").partitionBy("week").insertInto(db.tablename) |
Note: The higher number is introduced to increase the amount of parallelism in Spark, which is useful when we have huge workload.
Created 09-18-2017 05:08 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thank you for your replies. I am already following the above said methods. Was wondering if there is a way to use this property
hive.merge.sparkfiles=true whick takes of combining small files automatically.