Created 05-29-2018 02:06 PM
Hello, I am struggling to find suitable APIs to process multiple data frames in parallel. My requirement is the following-
I have 10s of distinct spark data frames. A certain set of operations must be performed on each DF ( treating each as a single partition), and some results must be returned from each processing. Ex:
Apply func1, func2, func3 to DF1, DF2 and DF3, return list1, list2 and list3 from each.
So, in theory, func1, func2 and func3 can be run in parallel. Wondering if there is any pyspark pattern I can follow.
Thanks !
Created 05-29-2018 07:20 PM
No responses ? Does that mean it is not possible or there is something very obvious that I am missing 🙂
Created 05-30-2018 02:31 AM
There is not simple OOB solution to this AFAIK. Have you considered using multiple threads on driver side to do this? In the following threads they discuss using Future to do just that, perhaps this could help you.
HTH
*** If you found this answer addressed your question, please take a moment to login and click the "accept" link on the answer.
Created 05-30-2018 06:22 PM
@Developer Developer As @Felix Albani suggested above i'd go with spawning multiple threads to process the dataframes in parallel. This article has a good example : https://hadoopist.wordpress.com/2017/02/03/how-to-use-threads-in-spark-job-to-achieve-parallel-read-...