Support Questions
Find answers, ask questions, and share your expertise

What are the advantages of Dataframe in Apache Spark?


A DataFrame is a distributed collection of data organized into named columns. It is conceptually equivalent to a table in a relational database.
DataFrames can be constructed from a wide array of sources such as: structured data files, tables in Hive, external databases, or existing RDDs.

Similar to RDDs, DataFrames are evaluated lazily. That is to say, computation only happens when an action (e.g. display result, save output) is required.

Out of the box, DataFrame supports reading data from the most popular formats, including JSON files, Parquet files, Hive tables. It can read from local file systems, distributed file systems (HDFS), cloud storage (S3), and external relational database systems via JDBC. In addition, through Spark SQL’s external data sources API, DataFrames can be extended to support any third-party data formats or sources. Existing third-party extensions already include Avro, CSV, ElasticSearch, and Cassandra.

In short we can assume like a DataFrame is like a table which acts like a table in normal RDBMS. Here in spark we use Dataframe to write sql queries on data loaded/available in RDD's.

Take a Tour of the Community
Don't have an account?
Your experience may be limited. Sign in to explore more.