Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Optimization techniques for spark jobs

Highlighted

Optimization techniques for spark jobs

Explorer

I see people ask that what are the optimization techniques you use for your spark job , what are these optimization techniques we can use for spark jobs? Like while writing spark job code or for submitting or to run job with optimal resources.

1 REPLY 1
Highlighted

Re: Optimization techniques for spark jobs

Expert Contributor

There are optimizations which can make your Spark jobs run faster. I'm just listing some here, you can read more about them in documentation.

  1. .repartition and .coalesce (to increase/decrease) parallelism
  2. .cache (to keep an rdd in cache, so it is not evaluated every time from base dataset, use KyroSerialization for compactness)
  3. .broadcast (only for objects, not for rdds) - to send a copy of a large dependent data (list/text) to each executor which caches it locally. The driver needn't send the same data over the network again.
  4. checkpointing (though this is mainly for fault tolerance)
  5. .mapPartitions vs .map

Thanks

Don't have an account?
Coming from Hortonworks? Activate your account here