Cloudera Community

Support Questions

Find answers, ask questions, and share your expertise

Advanced Search

Jack_sparrow

Explorer

I'm having a Hive table emp1 with 100 partitions in Text format.

I want Spark to read emp table based on partitions bases and write to EMP2 in parquet format. How to achieve 1) 10 Partition Read from EMP and parallel write to emp2. Making sure multiple executors are running and processing the data in parallel to convert data in short period. I don't want to use HDFS path.

403 Views

1 REPLY 1

vafs

Expert Contributor

Hello @Jack_sparrow

That should be possible.
You don't need to manually specify partitions or HDFS paths; Spark handles this automatically when you use a DataFrameReader.

First, you will need to read the source table using "spark.read.table()". Since table is a Hive partitioned table, Spark will automatically discover and read all 100 partitions in parallel, as long as you have enough executors and cores available. Then, Spark creates a logical plan to read the data.

Repartition the data is next, To ensure you have exactly 10 output partitions and to control the parallelism for the write operation, you can use the "repartition(10)" method. This will shuffle the data to create 10 new partitions, which will be processed by 10 different tasks.

And then, write the table. Use "write.saveAsTable()". You must specify the format using ".format("parquet")."

Regards,
Andrés Fallas
--
Was your question answered? Please take some time to click on "Accept as Solution" below this post.
If you find a reply useful, say thanks by clicking on the thumbs-up button.

359 Views