Support Questions

vijieka · ‎09-15-2017

I am reading lot of csv files s3 via Spark and writing into a hive table as orc. While writing, it is writing lot of small files. I need to merge all these files, i tried setting the property sqlContext.sql("set hive.merge.sparkfiles=true"). But this has no impact. Given below the code.

Please help.

sqlContext.sql("set hive.merge.sparkfiles=true")
sqlContext.sql("set hive.merge.smallfiles.avgsize=128000000")
sqlContext.sql("set hive.merge.size.per.task=128000000")
new_df.registerTempTable("new_df") sqlContext.sql("INSERT OVERWRITE TABLE db.tablename " +

"PARTITION(`week`='W_1617')" +

"SELECT col1, col2 FROM new_df"

balavignesh_nag · ‎09-15-2017

Hi @Vijayalakshmi Ekambaram

If you are running any jobs in HDFS then by default it should be created as splits so that parallel processing can be benefited and work effectively. However it depends on the block size and mapred size set in the property. When writing the data into hive is it creating file splits less than the block size? If it is not, then that shouldn't be a problem.

After writing a file to hive table you can make use concatenate option to merge the small files.

Hope it helps!!

Use the below sample command:

ALTER TABLE tablename [PARTITION partition_spec] CONCATENATE;

dhyun · ‎09-15-2017

If you want to write in a single file, could you try to repartition `new_df` before making temp table?

new_df.repartition(1).registerTempTable("new_df")...

According to your situation, you may choose a different number of partitions instead of 1.

vijaynew92 · ‎09-16-2017

Hi @Vijayalakshmi Ekambaram,

Spark default nature is to perform 200 partitions when doing aggregations , which is defined by the conf variable "spark.sql.shuffle.partitions" (default value is 200). This is the reason you will find lot of small files in the hive URI after each insert into hive table using Spark.

you can use the below "coalesce" statement to avoid writing many files into hive. (Coalesce won't perform full shuffle where as repartition does a full shuffle of data across network)

new_df.coalesce(2).write.mode("append").partitionBy("week").insertInto(db.tablename)

Note: The higher number is introduced to increase the amount of parallelism in Spark, which is useful when we have huge workload.

vijieka · ‎09-18-2017

Thank you for your replies. I am already following the above said methods. Was wondering if there is a way to use this property

hive.merge.sparkfiles=true whick takes of combining small files automatically.

Cloudera Community

Support Questions

Merge small files in spark while writing into hive orc table

How to identify in cdp cluster having small files ...

Small Files in Hadoop

RDBMS to Hive using NiFi (small-medium tables)

How to compact ORC files on Hive.

Error on concatenating ORC Hive table (merge files...

Performance Comparison b/w ORC SNAPPY and ZLib in ...

hive Insert to Dynamic Partition query Generating ...

Reading ORC files using Mapreduce

Hive writing many small csv files to HDFS

Hive Query against ORC table failing with serious ...