Support Questions

Freakabhi · ‎03-22-2017

Hi All,

I am converting long time taking SQL into hive-Spark SQL based solution, I have two options

1) create data frame for each of the hive table and replicate SQL and run on the Spark

table1 = sqlContext.sql("select * from table1")

table1.registerAsTempTabble("table1")

.... similarly for all the tables, and replicate the SQL and run on spark

pros: faster prototyping

2) use DataFrame Api using pyspark, like df.distinct().select().....

relatively slower developement time,

what are pros and cons of one verses other ? and how to choose?

thanks

Abhijeet Rajput

yvora · ‎03-22-2017

@Abhijeet Rajput, Found an article which compares performance of RDD/ Dataframe and SQL . It will help you make informed decision.

https://community.hortonworks.com/articles/42027/rdd-vs-dataframe-vs-sparksql.html

In summary,

You mainly need to analyze your use case ( like what type of queries will you be running , how big is data set etc).

Depending on your use case, you can choose to go with either SQL or Dataframe API.

For example: If your use case involves lot of groupby, orderby like queries, you should go with sparkSQL instead data frame api. ( because sparkSQL executes faster than data frame api for such use case)

View solution in original post

yvora · ‎03-22-2017

@Abhijeet Rajput, Found an article which compares performance of RDD/ Dataframe and SQL . It will help you make informed decision.

https://community.hortonworks.com/articles/42027/rdd-vs-dataframe-vs-sparksql.html

In summary,

You mainly need to analyze your use case ( like what type of queries will you be running , how big is data set etc).

Depending on your use case, you can choose to go with either SQL or Dataframe API.

For example: If your use case involves lot of groupby, orderby like queries, you should go with sparkSQL instead data frame api. ( because sparkSQL executes faster than data frame api for such use case)

Cloudera Community

Support Questions

Pyspark - Spark SQL

Spark (PySpark) to extract from SQL Server

Spark 3 legacy configurations list ( Spark 2 behav...

How to Create an Iceberg Table with PySpark in Clo...

Spark Structured Streaming with NiFi and Kafka (us...

Spark Python Supportability Matrix

Using VirtualEnv with PySpark

SPARK Throwing error while using pyspark on sql co...

Spark Python Integration Test Result Exceptions

JSON to SQL using Spark

Using VirtualEnv with PySpark