Why RDD looks to be faster on certain operations like filter compared to DataFrame.
Spark is best known for RDD, where a data can be stored in-memory and transformed based on the needs. Where as dataframes are not stored as the data's are being utilized in RDD. When data stored in the RDD (Similar to cache) , spark can access fast than data stored as dataframe. RDD has specialized memory on which data can be stored and retrieved. Data's are stored as partitions of chunks which enables parallelism of IO unlike DF which is not coupled with spark as a RDD does. Whenever you read a data from RDD due to partitions of data chunks and parallelism multiple threads will be hitting the data to perform IO operations which makes it faster than DF.
Can you share example of when filter is executing slower using DataFrame compared to RDD?. Which version of Spark, are you using Python or Scala.
The main disadvantage of RDDs is that they don’t perform particularly well whenever Spark needs to distribute the data within the cluster, or write the data to disk, it does so using Java serialization by default (although it is possible to use Kryo as a faster alternative in most cases). The overhead of serializing individual Java and Scala objects is expensive and requires sending both data and structure between nodes (each serialized object contains the class structure as well as the values). There is also the overhead of garbage collection that results from creating and destroying individual objects.
So in Spark 1.3 DataFrame API was introduced which seeks to improve the performance and scalability of Spark. The DataFrame API introduces the concept of a schema to describe the data, allowing Spark to manage the schema and only pass data between nodes, in a much more efficient way than using Java serialization. There are also advantages when performing computations in a single process as Spark can serialize the data into off-heap storage in a binary format and then perform many transformations directly on this off-heap memory, avoiding the garbage-collection costs associated with constructing individual objects for each row in the data set. Because Spark understands the schema, there is no need to use Java serialization to encode the data.
Query plans are created for execution using Spark catalyst optimiser. After an optimised execution plan is prepared going through some steps, the final execution happens internally on RDDs only but thats completely hidden from the users.
Please find below the list of useful blogs :