About Jack_sparrow

vafs · ‎09-15-2025

Hello @Jack_sparrow That should be possible. You don't need to manually specify partitions or HDFS paths; Spark handles this automatically when you use a DataFrameReader. First, you will need to read the source table using "spark.read.table()". Since table is a Hive partitioned table, Spark will automatically discover and read all 100 partitions in parallel, as long as you have enough executors and cores available. Then, Spark creates a logical plan to read the data. Repartition the data is next, To ensure you have exactly 10 output partitions and to control the parallelism for the write operation, you can use the "repartition(10)" method. This will shuffle the data to create 10 new partitions, which will be processed by 10 different tasks. And then, write the table. Use "write.saveAsTable()". You must specify the format using ".format("parquet")."

vafs · ‎09-11-2025

Hello @Jack_sparrow, Spark should automatically do it, you can control that with these settings: Input splits are controlled by spark.sql.files.maxPartitionBytes (default 128MB). If smaller, more splits or parallel tasks will be executed. spark.sql.files.openCostInBytes (default 4MB) influences how Spark coalesces small files. Shuffle parallelism spark.sql.shuffle.partitions (default 200). Configiure around 2–3 times per total executor cores. Also, make sure df.write.parquet() doesn’t set everything into few files only. For that, you can use .repartition(n) to increase the parallelism before writing.

Jack_sparrow · ‎09-08-2025

Thank you for the response.

Online	Offline
Last Visited	‎09-16-2025 06:51 AM

Member Since	‎09-08-2025 04:58 AM
Last Visited	‎09-16-2025 06:51 AM
Posts	5
Kudos received	1

Cloudera Community

Re: Spark optimum solution

Re: PySpark Queries

Re: How to run spark df.write inside UDF called in...