question Pyspark - Spark SQL in Archives of Support Questions (Read Only)

Pyspark - Spark SQL

Freakabhi — Wed, 22 Mar 2017 08:19:13 GMT

Hi All,

I am converting long time taking SQL into hive-Spark SQL based solution, I have two options

1) create data frame for each of the hive table and replicate SQL and run on the Spark

table1 = sqlContext.sql("select * from table1")

table1.registerAsTempTabble("table1")

.... similarly for all the tables, and replicate the SQL and run on spark

pros: faster prototyping

2) use DataFrame Api using pyspark, like df.distinct().select().....

relatively slower developement time,

what are pros and cons of one verses other ? and how to choose?

thanks

Abhijeet Rajput

Re: Pyspark - Spark SQL

yvora — Thu, 23 Mar 2017 01:58:45 GMT

@Abhijeet Rajput, Found an article which compares performance of RDD/ Dataframe and SQL . It will help you make informed decision.

https://community.hortonworks.com/articles/42027/rdd-vs-dataframe-vs-sparksql.html

In summary,

You mainly need to analyze your use case ( like what type of queries will you be running , how big is data set etc).

Depending on your use case, you can choose to go with either SQL or Dataframe API.

For example: If your use case involves lot of groupby, orderby like queries, you should go with sparkSQL instead data frame api. ( because sparkSQL executes faster than data frame api for such use case)