Archives of Support Questions (Read Only)

This is an archived board for historical reference. Information and links may no longer be available or relevant
Announcements
This board is archived and read-only for historical reference. To ask a new question, please post a new topic on the appropriate active board.

Pyspark - Spark SQL

avatar
Contributor

Hi All,

I am converting long time taking SQL into hive-Spark SQL based solution, I have two options

1) create data frame for each of the hive table and replicate SQL and run on the Spark

table1 = sqlContext.sql("select * from table1")

table1.registerAsTempTabble("table1")

.... similarly for all the tables, and replicate the SQL and run on spark

pros: faster prototyping

2) use DataFrame Api using pyspark, like df.distinct().select().....

relatively slower developement time,

what are pros and cons of one verses other ? and how to choose?

thanks

Abhijeet Rajput

1 ACCEPTED SOLUTION

avatar
Guru

@Abhijeet Rajput, Found an article which compares performance of RDD/ Dataframe and SQL . It will help you make informed decision.

https://community.hortonworks.com/articles/42027/rdd-vs-dataframe-vs-sparksql.html

In summary,

You mainly need to analyze your use case ( like what type of queries will you be running , how big is data set etc).

Depending on your use case, you can choose to go with either SQL or Dataframe API.

For example: If your use case involves lot of groupby, orderby like queries, you should go with sparkSQL instead data frame api. ( because sparkSQL executes faster than data frame api for such use case)

View solution in original post

1 REPLY 1

avatar
Guru

@Abhijeet Rajput, Found an article which compares performance of RDD/ Dataframe and SQL . It will help you make informed decision.

https://community.hortonworks.com/articles/42027/rdd-vs-dataframe-vs-sparksql.html

In summary,

You mainly need to analyze your use case ( like what type of queries will you be running , how big is data set etc).

Depending on your use case, you can choose to go with either SQL or Dataframe API.

For example: If your use case involves lot of groupby, orderby like queries, you should go with sparkSQL instead data frame api. ( because sparkSQL executes faster than data frame api for such use case)