Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Pyspark - Spark SQL

avatar
Contributor

Hi All,

I am converting long time taking SQL into hive-Spark SQL based solution, I have two options

1) create data frame for each of the hive table and replicate SQL and run on the Spark

table1 = sqlContext.sql("select * from table1")

table1.registerAsTempTabble("table1")

.... similarly for all the tables, and replicate the SQL and run on spark

pros: faster prototyping

2) use DataFrame Api using pyspark, like df.distinct().select().....

relatively slower developement time,

what are pros and cons of one verses other ? and how to choose?

thanks

Abhijeet Rajput

1 ACCEPTED SOLUTION

avatar
Guru

@Abhijeet Rajput, Found an article which compares performance of RDD/ Dataframe and SQL . It will help you make informed decision.

https://community.hortonworks.com/articles/42027/rdd-vs-dataframe-vs-sparksql.html

In summary,

You mainly need to analyze your use case ( like what type of queries will you be running , how big is data set etc).

Depending on your use case, you can choose to go with either SQL or Dataframe API.

For example: If your use case involves lot of groupby, orderby like queries, you should go with sparkSQL instead data frame api. ( because sparkSQL executes faster than data frame api for such use case)

View solution in original post

1 REPLY 1

avatar
Guru

@Abhijeet Rajput, Found an article which compares performance of RDD/ Dataframe and SQL . It will help you make informed decision.

https://community.hortonworks.com/articles/42027/rdd-vs-dataframe-vs-sparksql.html

In summary,

You mainly need to analyze your use case ( like what type of queries will you be running , how big is data set etc).

Depending on your use case, you can choose to go with either SQL or Dataframe API.

For example: If your use case involves lot of groupby, orderby like queries, you should go with sparkSQL instead data frame api. ( because sparkSQL executes faster than data frame api for such use case)