Support Questions
Find answers, ask questions, and share your expertise

Optimization techniques for spark jobs


I see people ask that what are the optimization techniques you use for your spark job , what are these optimization techniques we can use for spark jobs? Like while writing spark job code or for submitting or to run job with optimal resources.


Expert Contributor

There are optimizations which can make your Spark jobs run faster. I'm just listing some here, you can read more about them in documentation.

  1. .repartition and .coalesce (to increase/decrease) parallelism
  2. .cache (to keep an rdd in cache, so it is not evaluated every time from base dataset, use KyroSerialization for compactness)
  3. .broadcast (only for objects, not for rdds) - to send a copy of a large dependent data (list/text) to each executor which caches it locally. The driver needn't send the same data over the network again.
  4. checkpointing (though this is mainly for fault tolerance)
  5. .mapPartitions vs .map


Take a Tour of the Community
Don't have an account?
Your experience may be limited. Sign in to explore more.