My current case is that I am wrapping a DataFrame in another class called Table to extend functionality.
I have a few wrapper methods that modify the encapsulated DataFrame and return a new Table object.
I also have a few methods that essentially perform a groupBy(...).agg(...) and iterated join.
My question is, will Spark still be able to optimize under the hood the spark SQL methods I call, or will adding an layer of abstraction like a class or class methods prevent some/all optimization?
It depends on what you are doing. If you are encapsulating a function where you are passing a DataFrame and using the sql functions, then it will be fine.
However, what you are describing above will hurt your performance. If you are performing a groupBy().agg(), which should be replaced in almost every case by reduceBy(), then your performance will really suffer. What is happening behind the scenes is pyspark has to pickle the DataFrame if you are using a UDF. If you have a DataFrame, use the DataFrame API to do your logic. That API will actually execute in Java/Scala, so all the performance will be the same as if they were written in those languages.
We have removed all the UDFs and created functions using the sql functions instead and have seen huge performance gains.
I can't find reduceBy() in Java/Scala Dataframes so I just use groupBy and agg. Is there no way around them? I know there is reduceBy in RDDs...