As volume of data is increasing spark's performance is impacting, Can some one please suggest on spark performance tuning or any other way to improve performance .
1. Analyze your spark code to find ways to optimize Joins, groupby, reduceby, combineby.
2. Increase the parallelism in case of groupBy.
3. Try to use CombineInputformat for smaller files.
4. Compress and save your data(ORC, parquet is best suited)
5. If you have job sequence, try caching intermediate data in memory if possible.
6. Estimate the size of your data if possible launch containers with sufficient RAM to hold the data.
7. Dynamic allocation is best for most use cases.
8. Use coallesece instead of repartition if needed.