Support Questions

Find answers, ask questions, and share your expertise

why dataframes are faster in all lnaguages?

avatar

Why spark dataframes are faster in scala/python. the same is not the case with RDD's. RDD's created in scala are faster than the one in python.

1 ACCEPTED SOLUTION

avatar
Master Guru

The following has a good overview:

Essentially RDDs are directly implemented code. Whatever you write gets executed. Dataframes on the other hand get compiled into an execution plan and then executed by the same engine. Essentially there is only a Dataframe API in python. So you would expect the same performance unless you use big heavy python udfs.

https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science....

Essentially Dataframes have two advantages over RDDs. A) Most people do not know how to write optimized Spark code and B) the optimizer can do some tricks based on data charateristics that a user might not be aware of during write time. ( dataset size etc. )

https://0x0fff.com/spark-dataframes-are-faster-arent-they/

Edit: I was curious and did dig a bit deeper and I think here is the best overview, essentially as said, as long as you use the basic Dataframe functions performance is equal because the Dataframe code ( Python or Scala ) gets translated into the same code RDD code ( scala ) but if you use heavy python udfs you will see performance differences again.

https://ogirardot.wordpress.com/2015/05/29/rdds-are-the-new-bytecode-of-apache-spark/

"This is still true if you want to use Dataframe’s User Defined Functions, you can write them in Java/Scala or Python and this will impact your computation performance – but if you manage to stay in a pure Dataframe computation – then nothing will get between you and the best computation performance you can possibly get."

View solution in original post

3 REPLIES 3

avatar

Hi @ARUNKUMAR RAMASAMY the best resource I've found describing this can be found here:

https://0x0fff.com/spark-dataframes-are-faster-arent-they/

I'd be interested to know if this answers your question or not!

avatar
Master Guru

The following has a good overview:

Essentially RDDs are directly implemented code. Whatever you write gets executed. Dataframes on the other hand get compiled into an execution plan and then executed by the same engine. Essentially there is only a Dataframe API in python. So you would expect the same performance unless you use big heavy python udfs.

https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science....

Essentially Dataframes have two advantages over RDDs. A) Most people do not know how to write optimized Spark code and B) the optimizer can do some tricks based on data charateristics that a user might not be aware of during write time. ( dataset size etc. )

https://0x0fff.com/spark-dataframes-are-faster-arent-they/

Edit: I was curious and did dig a bit deeper and I think here is the best overview, essentially as said, as long as you use the basic Dataframe functions performance is equal because the Dataframe code ( Python or Scala ) gets translated into the same code RDD code ( scala ) but if you use heavy python udfs you will see performance differences again.

https://ogirardot.wordpress.com/2015/05/29/rdds-are-the-new-bytecode-of-apache-spark/

"This is still true if you want to use Dataframe’s User Defined Functions, you can write them in Java/Scala or Python and this will impact your computation performance – but if you manage to stay in a pure Dataframe computation – then nothing will get between you and the best computation performance you can possibly get."

avatar

Spark DataFrame use the Catalyst optimiser under the hood. The spark code gets transformed into an abstract syntax tree or logic plan on which several optimisations are applied before code is being generated from it. See the following paper for the fun explanation https://amplab.cs.berkeley.edu/wp-content/uploads/2015/03/SparkSQLSigmod2015.pdf.

The reason why Spark DataFrames are fast in all languages is because whether you use Python, Java or Scala the implementation used under the hood is the Scala Implementation of a Data Frame or Catalyst Optimiser. Whether you use Scala, Java or Python the Logical Plans are passed on to the same Catalyst Optimiser.