Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Spark inserting data into hive external table creating very small part files. Is there a way other than repartition(which slows the processing) to combine all 1mb files into multiple big files?

Highlighted

Spark inserting data into hive external table creating very small part files. Is there a way other than repartition(which slows the processing) to combine all 1mb files into multiple big files?

New Contributor

spark.sql("set pyspark.hadoop.hive.exec.dynamic.partition=true")

spark.sql("set pyspark.hadoop.hive.exec.dynamic.partition.mode=nonstrict")

spark.sql("set hive.exec.dynamic.partition=true")

spark.sql("set hive.exec.dynamic.partition.mode=nonstrict")

spark.sql("set hive.merge.tezfiles=true")

spark.sql("SET hive.merge.sparkfiles = true")

spark.sql("set hive.merge.smallfiles.avgsize=128000000")

spark.sql("set hive.merge.size.per.task=128000000")

1 REPLY 1

Re: Spark inserting data into hive external table creating very small part files. Is there a way other than repartition(which slows the processing) to combine all 1mb files into multiple big files?

Super Guru

@manohar ghanta

Option-1:

You can do .coalesce(n)(no shuffle will happen) on your dataframe and then use .option("maxRecordsPerFile",n) to control the number of records written in each file.

Option-2:

Using spark.sql.shuffle.partitions=n this option is used to control the number of shuffles happens.

Then use df.sort("<col_name>").write.etc will create exactly the number of files that we mentioned for shuffle.partitions.

Option-3:

Hive:

Once the spark job is done then trigger hive job insert overwrite by selecting the same table and use sortby,distributedby,clusteredby and set the all hive configurations that you have mentioned in the question.

Insert overwrite table select * from table sort by <col1> distributed by <col2>

Option-4:

Hive:

If you have ORC table then schedule concatenate job to run periodically

alter table <table_name> concatenate;

If none of the methods seems to be feasible solutions then .repartition(n) will be the way to go as this will take extra overhead but we are going to end up ~evenly sized filesin HDFS and boost up the performance while reading these files from hive/spark.

-

If the answer is helpful to resolve the issue, Login and Click on Accept button below to close this thread.This will help other community users to find answers quickly :-)

Don't have an account?
Coming from Hortonworks? Activate your account here