Support Questions
Find answers, ask questions, and share your expertise
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Performance improvement techniques for Spark SQL DataFrames


Performance improvement techniques for Spark SQL DataFrames


Sorry for being very abstract. I would like to ask this question, as many of you came across situation like me.

We need to build a machine learning based application. As part of our application we need to join files with huge data. The data is around 10 GB. We are using data frames to load the files and using cache as well.

However, there are quite a bit of joins that we need to perform. Could you please suggest some optimization techniques that you came across to ensure the application performs upto speed.

Is persist to disk with kyro serializer improves the performance.? We have join operations mainly. Other operations are not that computation heavy.

Don't have an account?
Coming from Hortonworks? Activate your account here