I see people ask that what are the optimization techniques you use for your spark job , what are these optimization techniques we can use for spark jobs? Like while writing spark job code or for submitting or to run job with optimal resources.
There are optimizations which can make your Spark jobs run faster. I'm just listing some here, you can read more about them in documentation.
.repartition and .coalesce (to increase/decrease) parallelism
.cache (to keep an rdd in cache, so it is not evaluated every time from base dataset, use KyroSerialization for compactness)
.broadcast (only for objects, not for rdds) - to send a copy of a large dependent data (list/text) to each executor which caches it locally. The driver needn't send the same data over the network again.
checkpointing (though this is mainly for fault tolerance)