Support Questions
Find answers, ask questions, and share your expertise
Announcements
Check out our newest addition to the community, the Cloudera Innovation Accelerator group hub.

Process multiple dataframes in parallel using pyspark 2.1

Hello, I am struggling to find suitable APIs to process multiple data frames in parallel. My requirement is the following-

I have 10s of distinct spark data frames. A certain set of operations must be performed on each DF ( treating each as a single partition), and some results must be returned from each processing. Ex:

Apply func1, func2, func3 to DF1, DF2 and DF3, return list1, list2 and list3 from each.

So, in theory, func1, func2 and func3 can be run in parallel. Wondering if there is any pyspark pattern I can follow.

Thanks !

3 REPLIES 3

No responses ? Does that mean it is not possible or there is something very obvious that I am missing 🙂

@Developer Developer

There is not simple OOB solution to this AFAIK. Have you considered using multiple threads on driver side to do this? In the following threads they discuss using Future to do just that, perhaps this could help you.

https://stackoverflow.com/questions/31912858/processing-multiple-files-as-independent-rdds-in-parall...

https://stackoverflow.com/questions/46981424/how-to-process-multiple-dataframes-concurrently-in-spar...

HTH

*** If you found this answer addressed your question, please take a moment to login and click the "accept" link on the answer.

@Developer Developer As @Felix Albani suggested above i'd go with spawning multiple threads to process the dataframes in parallel. This article has a good example : https://hadoopist.wordpress.com/2017/02/03/how-to-use-threads-in-spark-job-to-achieve-parallel-read-...