Support Questions
Find answers, ask questions, and share your expertise
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

performance issue with spark module


performance issue with spark module

As volume of data is increasing spark's performance is impacting, Can some one please suggest on spark performance tuning or any other way to improve performance .


Re: performance issue with spark module

1. Analyze your spark code to find ways to optimize Joins, groupby, reduceby, combineby.
2. Increase the parallelism in case of groupBy.
3. Try to use CombineInputformat for smaller files.
4. Compress and save your data(ORC, parquet is best suited)
5. If you have job sequence, try caching intermediate data in memory if possible.
6. Estimate the size of your data if possible launch containers with sufficient RAM to hold the data.
7. Dynamic allocation is best for most use cases.
8. Use coallesece instead of repartition if needed.

Don't have an account?
Coming from Hortonworks? Activate your account here