Member since
09-08-2025
4
Posts
1
Kudos Received
0
Solutions
09-11-2025
11:04 AM
Hello @Jack_sparrow, Spark should automatically do it, you can control that with these settings: Input splits are controlled by spark.sql.files.maxPartitionBytes (default 128MB). If smaller, more splits or parallel tasks will be executed. spark.sql.files.openCostInBytes (default 4MB) influences how Spark coalesces small files. Shuffle parallelism spark.sql.shuffle.partitions (default 200). Configiure around 2–3 times per total executor cores. Also, make sure df.write.parquet() doesn’t set everything into few files only. For that, you can use .repartition(n) to increase the parallelism before writing.
... View more
09-08-2025
09:01 PM
1 Kudo
Thank you for the response.
... View more