Support Questions

devquestions2 · ‎05-29-2018

Hello, I am struggling to find suitable APIs to process multiple data frames in parallel. My requirement is the following-

I have 10s of distinct spark data frames. A certain set of operations must be performed on each DF ( treating each as a single partition), and some results must be returned from each processing. Ex:

Apply func1, func2, func3 to DF1, DF2 and DF3, return list1, list2 and list3 from each.

So, in theory, func1, func2 and func3 can be run in parallel. Wondering if there is any pyspark pattern I can follow.

Thanks !

devquestions2 · ‎05-29-2018

No responses ? Does that mean it is not possible or there is something very obvious that I am missing 🙂

falbani · ‎05-30-2018

@Developer Developer

There is not simple OOB solution to this AFAIK. Have you considered using multiple threads on driver side to do this? In the following threads they discuss using Future to do just that, perhaps this could help you.

https://stackoverflow.com/questions/31912858/processing-multiple-files-as-independent-rdds-in-parall...

https://stackoverflow.com/questions/46981424/how-to-process-multiple-dataframes-concurrently-in-spar...

HTH

*** If you found this answer addressed your question, please take a moment to login and click the "accept" link on the answer.

sandyy006 · ‎05-30-2018

@Developer Developer As @Felix Albani suggested above i'd go with spawning multiple threads to process the dataframes in parallel. This article has a good example : https://hadoopist.wordpress.com/2017/02/03/how-to-use-threads-in-spark-job-to-achieve-parallel-read-...

Cloudera Community

Support Questions

Process multiple dataframes in parallel using pyspark 2.1

Spark 2.1 Hive ORC saveAsTable pyspark

Hbase filter query using pyspark

Start process group using nifi REST API

Using Cloudera Flow Management To Ingest and Proc...

Using Cloudbreak to deploy HDP 2.6 and Spark 2.1 o...

Using Spark 1.6.x dataframes to access SAP HANA ...

Zeppelin Multiple Instances

Not able to split the column into multiple columns...

Trigger based/Serial Data processing in NiFi using...

How to tune Spark for parallel processing when loa...