- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
Pyspark - Spark SQL
- Labels:
-
Apache Spark
Created ‎03-22-2017 01:19 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi All,
I am converting long time taking SQL into hive-Spark SQL based solution, I have two options
1) create data frame for each of the hive table and replicate SQL and run on the Spark
table1 = sqlContext.sql("select * from table1")
table1.registerAsTempTabble("table1")
.... similarly for all the tables, and replicate the SQL and run on spark
pros: faster prototyping
2) use DataFrame Api using pyspark, like df.distinct().select().....
relatively slower developement time,
what are pros and cons of one verses other ? and how to choose?
thanks
Abhijeet Rajput
Created ‎03-22-2017 06:58 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@Abhijeet Rajput, Found an article which compares performance of RDD/ Dataframe and SQL . It will help you make informed decision.
https://community.hortonworks.com/articles/42027/rdd-vs-dataframe-vs-sparksql.html
In summary,
You mainly need to analyze your use case ( like what type of queries will you be running , how big is data set etc).
Depending on your use case, you can choose to go with either SQL or Dataframe API.
For example: If your use case involves lot of groupby, orderby like queries, you should go with sparkSQL instead data frame api. ( because sparkSQL executes faster than data frame api for such use case)
Created ‎03-22-2017 06:58 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@Abhijeet Rajput, Found an article which compares performance of RDD/ Dataframe and SQL . It will help you make informed decision.
https://community.hortonworks.com/articles/42027/rdd-vs-dataframe-vs-sparksql.html
In summary,
You mainly need to analyze your use case ( like what type of queries will you be running , how big is data set etc).
Depending on your use case, you can choose to go with either SQL or Dataframe API.
For example: If your use case involves lot of groupby, orderby like queries, you should go with sparkSQL instead data frame api. ( because sparkSQL executes faster than data frame api for such use case)
