Support Questions

devquestions2 · ‎05-29-2018

Hello, I am struggling to find suitable APIs to process multiple data frames in parallel. My requirement is the following-

I have 10s of distinct spark data frames. A certain set of operations must be performed on each DF ( treating each as a single partition), and some results must be returned from each processing. Ex:

Apply func1, func2, func3 to DF1, DF2 and DF3, return list1, list2 and list3 from each.

So, in theory, func1, func2 and func3 can be run in parallel. Wondering if there is any pyspark pattern I can follow.

Thanks !

devquestions2 · ‎05-29-2018

No responses ? Does that mean it is not possible or there is something very obvious that I am missing 🙂

falbani · ‎05-30-2018

@Developer Developer

There is not simple OOB solution to this AFAIK. Have you considered using multiple threads on driver side to do this? In the following threads they discuss using Future to do just that, perhaps this could help you.

https://stackoverflow.com/questions/31912858/processing-multiple-files-as-independent-rdds-in-parall...

https://stackoverflow.com/questions/46981424/how-to-process-multiple-dataframes-concurrently-in-spar...

HTH

*** If you found this answer addressed your question, please take a moment to login and click the "accept" link on the answer.

sandyy006 · ‎05-30-2018

@Developer Developer As @Felix Albani suggested above i'd go with spawning multiple threads to process the dataframes in parallel. This article has a good example : https://hadoopist.wordpress.com/2017/02/03/how-to-use-threads-in-spark-job-to-achieve-parallel-read-...

Cloudera Community

Support Questions

Process multiple dataframes in parallel using pyspark 2.1