Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Wrapping DataFrames in new classes / functions: does this preserve efficiency?

Wrapping DataFrames in new classes / functions: does this preserve efficiency?

Explorer

My current case is that I am wrapping a DataFrame in another class called Table to extend functionality.

I have a few wrapper methods that modify the encapsulated DataFrame and return a new Table object.

I also have a few methods that essentially perform a groupBy(...).agg(...) and iterated join.

My question is, will Spark still be able to optimize under the hood the spark SQL methods I call, or will adding an layer of abstraction like a class or class methods prevent some/all optimization?

2 REPLIES 2
Highlighted

Re: Wrapping DataFrames in new classes / functions: does this preserve efficiency?

It depends on what you are doing. If you are encapsulating a function where you are passing a DataFrame and using the sql functions, then it will be fine.

However, what you are describing above will hurt your performance. If you are performing a groupBy().agg(), which should be replaced in almost every case by reduceBy(), then your performance will really suffer. What is happening behind the scenes is pyspark has to pickle the DataFrame if you are using a UDF. If you have a DataFrame, use the DataFrame API to do your logic. That API will actually execute in Java/Scala, so all the performance will be the same as if they were written in those languages.

We have removed all the UDFs and created functions using the sql functions instead and have seen huge performance gains.

Highlighted

Re: Wrapping DataFrames in new classes / functions: does this preserve efficiency?

Explorer

I can't find reduceBy() in Java/Scala Dataframes so I just use groupBy and agg. Is there no way around them? I know there is reduceBy in RDDs...

Don't have an account?
Coming from Hortonworks? Activate your account here