Support Questions
Find answers, ask questions, and share your expertise

PySpark SQL - Large Queries



I would like to get the suggestions and correct way to convert very large queries like ( 1000 lines ) joining 10+ tables and complicated transforms to Py-Spark program


Also if there are relevent examples for large sqls.


and also I have question regarding writing SparkSQL program, is there difference of performance between writing


1) SQLContext.sql("select count(*) from (select distinct col1,col2 from table))")

2) using pyspark Api :"col1,col2").distinct().count().


I am from SQL background and we are working on converting existing logic to hadoop, hence SQL is handy.


simplifying the SQL statement by denormalization (using several temp/intermediate tables) is a common way of tuning extremely large queries. Regarding your second question. There shouldn't be any difference (the way you wrote the query). But using the API give you flexibility such as taking advantage of caching for example.

I don't know if there is a difference offhand. I recommend running both and then examining the different stages and ordering of the actions.