Created 03-22-2017 01:19 AM
Hi All,
I am converting long time taking SQL into hive-Spark SQL based solution, I have two options
1) create data frame for each of the hive table and replicate SQL and run on the Spark
table1 = sqlContext.sql("select * from table1")
table1.registerAsTempTabble("table1")
.... similarly for all the tables, and replicate the SQL and run on spark
pros: faster prototyping
2) use DataFrame Api using pyspark, like df.distinct().select().....
relatively slower developement time,
what are pros and cons of one verses other ? and how to choose?
thanks
Abhijeet Rajput
Created 03-22-2017 06:58 PM
@Abhijeet Rajput, Found an article which compares performance of RDD/ Dataframe and SQL . It will help you make informed decision.
https://community.hortonworks.com/articles/42027/rdd-vs-dataframe-vs-sparksql.html
In summary,
You mainly need to analyze your use case ( like what type of queries will you be running , how big is data set etc).
Depending on your use case, you can choose to go with either SQL or Dataframe API.
For example: If your use case involves lot of groupby, orderby like queries, you should go with sparkSQL instead data frame api. ( because sparkSQL executes faster than data frame api for such use case)
Created 03-22-2017 06:58 PM
@Abhijeet Rajput, Found an article which compares performance of RDD/ Dataframe and SQL . It will help you make informed decision.
https://community.hortonworks.com/articles/42027/rdd-vs-dataframe-vs-sparksql.html
In summary,
You mainly need to analyze your use case ( like what type of queries will you be running , how big is data set etc).
Depending on your use case, you can choose to go with either SQL or Dataframe API.
For example: If your use case involves lot of groupby, orderby like queries, you should go with sparkSQL instead data frame api. ( because sparkSQL executes faster than data frame api for such use case)