Support Questions

Find answers, ask questions, and share your expertise

Spark optimum solution

avatar
I'm having a Hive table emp1 with 100 partitions in Text format.

I want Spark to read emp table based on partitions bases and write to EMP2 in parquet format. How to achieve 1) 10 Partition Read from EMP and parallel write to emp2. Making sure multiple executors are running and processing the data in parallel to convert data in short period. I don't want to use HDFS path.

1 REPLY 1

avatar
Contributor

Hello @Jack_sparrow 

That should be possible. 
You don't need to manually specify partitions or HDFS paths; Spark handles this automatically when you use a DataFrameReader.

First, you will need to read the source table using "spark.read.table()". Since table is a Hive partitioned table, Spark will automatically discover and read all 100 partitions in parallel, as long as you have enough executors and cores available. Then, Spark creates a logical plan to read the data. 

Repartition the data is next, To ensure you have exactly 10 output partitions and to control the parallelism for the write operation, you can use the "repartition(10)" method. This will shuffle the data to create 10 new partitions, which will be processed by 10 different tasks.

And then, write the table. Use "write.saveAsTable()". You must specify the format using ".format("parquet")."


Regards,
Andrés Fallas
--
Was your question answered? Please take some time to click on "Accept as Solution" below this post.
If you find a reply useful, say thanks by clicking on the thumbs-up button.