Support Questions

Find answers, ask questions, and share your expertise

How to run spark df.write inside UDF called in rdd.foreach or rdd.foreachpartition

avatar

How to run spark df.write inside UDF called in rdd.foreach or rdd.foreachpartition

I.e. spark session object inside executor.

1 ACCEPTED SOLUTION

avatar
Contributor

Hello @Jack_sparrow

Glad to see you on the Community. 

As far as I know, df.write is not possible to be used on an rdd.foreach or rdd.foreachpartition. 
The reason is because df.write is a driver-side action, it triggers a Spark job.
rdd.foreach or rdd.foreachpartition are executors, and executors cannot trigger jobs.

Check these references: 
https://stackoverflow.com/questions/46964250/nullpointerexception-creating-dataset-dataframe-inside-...
https://stackoverflow.com/questions/46964250/nullpointerexception-creating-dataset-dataframe-inside-...
https://sparkbyexamples.com/spark/spark-foreachpartition-vs-foreach-explained

The option that looks like it works for you is this: 

df.write.partitionBy

Something like this: 

df.write.partitionBy("someColumn").parquet("/path/out")

 


Regards,
Andrés Fallas
--
Was your question answered? Please take some time to click on "Accept as Solution" below this post.
If you find a reply useful, say thanks by clicking on the thumbs-up button.

View solution in original post

3 REPLIES 3

avatar
Community Manager

@Jack_sparrow, Welcome to our community! To help you get the best possible answer, I have tagged in our Spark experts @haridjh and @vafs, who may be able to assist you further.

Please feel free to provide any additional information or details about your query. We hope that you will find a satisfactory solution to your question.



Regards,

Vidya Sargur,
Community Manager


Was your question answered? Make sure to mark the answer as the accepted solution.
If you find a reply useful, say thanks by clicking on the thumbs up button.
Learn more about the Cloudera Community:

avatar
Contributor

Hello @Jack_sparrow

Glad to see you on the Community. 

As far as I know, df.write is not possible to be used on an rdd.foreach or rdd.foreachpartition. 
The reason is because df.write is a driver-side action, it triggers a Spark job.
rdd.foreach or rdd.foreachpartition are executors, and executors cannot trigger jobs.

Check these references: 
https://stackoverflow.com/questions/46964250/nullpointerexception-creating-dataset-dataframe-inside-...
https://stackoverflow.com/questions/46964250/nullpointerexception-creating-dataset-dataframe-inside-...
https://sparkbyexamples.com/spark/spark-foreachpartition-vs-foreach-explained

The option that looks like it works for you is this: 

df.write.partitionBy

Something like this: 

df.write.partitionBy("someColumn").parquet("/path/out")

 


Regards,
Andrés Fallas
--
Was your question answered? Please take some time to click on "Accept as Solution" below this post.
If you find a reply useful, say thanks by clicking on the thumbs-up button.

avatar

Thank you for the response.