I would like to get the suggestions and correct way to convert very large queries like ( 1000 lines ) joining 10+ tables and complicated transforms to Py-Spark program
Also if there are relevent examples for large sqls.
and also I have question regarding writing SparkSQL program, is there difference of performance between writing
1) SQLContext.sql("select count(*) from (select distinct col1,col2 from table))")
2) using pyspark Api : df.select("col1,col2").distinct().count().
I am from SQL background and we are working on converting existing logic to hadoop, hence SQL is handy.