Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Parallelize Spark Driver work (broadcasting / serialization)

Parallelize Spark Driver work (broadcasting / serialization)

Expert Contributor

Hi community,

I've got a Spark 2.3 application in which I need to broadcast rather large (1-3 GB) objects. To do so, I am collecting the DataSets and broadcasting them.

Performance measurements show, that the driver spends a long time, serializing the objects, which are fairly complex indeed. During broadcasting / serialisation, the driver is only busy doing this task.

I am wondering, how to reduce this waiting-time. Is there a way to parallelize tasks such as broadcasting / serialization on the driver?

It would be for instance be helpful to erform multiple broadcasts in parallel, continue with other driver code during broadcasting or having a way to parallelize an individual broadcasting.

Best,

Benjamin