question Re: Spark optimum solution in Support Questions

question Re: Spark optimum solution in Support Questions https://community.cloudera.com/t5/Support-Questions/Spark-optimum-solution/m-p/412368#M253408 <P>Hello <a href="https://community.cloudera.com/t5/user/viewprofilepage/user-id/131645">@Jack_sparrow</a> </P><P>That should be possible. <BR />You don't need to manually specify partitions or HDFS paths; Spark handles this automatically when you use a DataFrameReader.</P><P>First, you will need to read the source table using "spark.read.table()". Since table is a Hive partitioned table, Spark will automatically discover and read all 100 partitions in parallel, as long as you have enough executors and cores available. Then, Spark creates a logical plan to read the data. </P><P>Repartition the data is next, To ensure you have exactly 10 output partitions and to control the parallelism for the write operation, you can use the "repartition(10)" method. This will shuffle the data to create 10 new partitions, which will be processed by 10 different tasks.</P><P>And then, write the table. Use "write.saveAsTable()". You must specify the format using ".format("parquet")."</P> Tue, 16 Sep 2025 04:44:49 GMT vafs 2025-09-16T04:44:49Z Spark optimum solution https://community.cloudera.com/t5/Support-Questions/Spark-optimum-solution/m-p/412342#M253391 <DIV class="votecell post-layout--left"><DIV class="js-voting-container d-flex jc-center fd-column ai-center gs4 fc-black-300"><DIV class="js-vote-count flex--item d-flex fd-column ai-center fc-theme-body-font fw-bold fs-subheading py4"><SPAN>I'm having a Hive table emp1 with 100 partitions in Text format.</SPAN></DIV></DIV></DIV><DIV class="postcell post-layout--right"><DIV class="s-prose js-post-body"><P>I want Spark to read emp table based on partitions bases and write to EMP2 in parquet format. How to achieve 1) 10 Partition Read from EMP and parallel write to emp2. Making sure multiple executors are running and processing the data in parallel to convert data in short period. I don't want to use HDFS path.</P></DIV></DIV> Fri, 12 Sep 2025 12:03:40 GMT https://community.cloudera.com/t5/Support-Questions/Spark-optimum-solution/m-p/412342#M253391 Jack_sparrow 2025-09-12T12:03:40Z Re: Spark optimum solution https://community.cloudera.com/t5/Support-Questions/Spark-optimum-solution/m-p/412368#M253408 <P>Hello <a href="https://community.cloudera.com/t5/user/viewprofilepage/user-id/131645">@Jack_sparrow</a> </P><P>That should be possible. <BR />You don't need to manually specify partitions or HDFS paths; Spark handles this automatically when you use a DataFrameReader.</P><P>First, you will need to read the source table using "spark.read.table()". Since table is a Hive partitioned table, Spark will automatically discover and read all 100 partitions in parallel, as long as you have enough executors and cores available. Then, Spark creates a logical plan to read the data. </P><P>Repartition the data is next, To ensure you have exactly 10 output partitions and to control the parallelism for the write operation, you can use the "repartition(10)" method. This will shuffle the data to create 10 new partitions, which will be processed by 10 different tasks.</P><P>And then, write the table. Use "write.saveAsTable()". You must specify the format using ".format("parquet")."</P> Tue, 16 Sep 2025 04:44:49 GMT https://community.cloudera.com/t5/Support-Questions/Spark-optimum-solution/m-p/412368#M253408 vafs 2025-09-16T04:44:49Z