Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

PySpark SQL - Large Queries

PySpark SQL - Large Queries

New Contributor

All,

I would like to get the suggestions and correct way to convert very large queries like ( 1000 lines ) joining 10+ tables and complicated transforms to Py-Spark program

 

Also if there are relevent examples for large sqls.

 

and also I have question regarding writing SparkSQL program, is there difference of performance between writing

 

1) SQLContext.sql("select count(*) from (select distinct col1,col2 from table))")

2) using pyspark Api : df.select("col1,col2").distinct().count().

 

I am from SQL background and we are working on converting existing logic to hadoop, hence SQL is handy.

3 REPLIES 3

Re: PySpark SQL - Large Queries

Explorer
simplifying the SQL statement by denormalization (using several temp/intermediate tables) is a common way of tuning extremely large queries. Regarding your second question. There shouldn't be any difference (the way you wrote the query). But using the API give you flexibility such as taking advantage of caching for example.

Re: PySpark SQL - Large Queries

Champion
I don't know if there is a difference offhand. I recommend running both and then examining the different stages and ordering of the actions.

Re: PySpark SQL - Large Queries

Champion