Hi All,
I am converting long time taking SQL into hive-Spark SQL based solution, I have two options
1) create data frame for each of the hive table and replicate SQL and run on the Spark
table1 = sqlContext.sql("select * from table1")
table1.registerAsTempTabble("table1")
.... similarly for all the tables, and replicate the SQL and run on spark
pros: faster prototyping
2) use DataFrame Api using pyspark, like df.distinct().select().....
relatively slower developement time,
what are pros and cons of one verses other ? and how to choose?
thanks
Abhijeet Rajput