Support Questions

Find answers, ask questions, and share your expertise

Pyspark SQL - Large Queries

avatar
Contributor
favorite

All,

I would like to get the suggestions and correct way to convert very large queries like ( 1000 lines ) joining 10+ tables and complicated transforms to Py-Spark program

and also I have question regarding writing SparkSQL program, is there difference of performance between writing

1) SQLContext.sql("select count(*) from (select distinct col1,col2 from table))") 2) using pyspark Api : df.select("col1,col2").distinct().count().

I am from SQL background and we are working on converting existing logic to hadoop, hence SQL is handy.

1 ACCEPTED SOLUTION

avatar
Super Collaborator

Hi @Abhijeet Rajput,

In response to handling the huge SQL, Spark does lazy evolution

which means you can split your code into multiple blocks and write using the multiple data frames.

That will be evaluated at last and uses the optimal execution plan that can accommodate for the operation.

Example :

var subquery1 = sql (“select
c1,c2,c3 form tbl1 join tbl2 on codition1 and condition 2”)
subquery1.registerTempTable(“res1”)
var subquery2 = sql (“select
c1,c2,c3 form res1 join tbl3 on codition4 and condition 5”)and so on….

On the other request, there is no difference between using the DataFrame base API or SQL as the same execution plan will be generated for both, you can validate the same from DAG schedule while on execution with Spark UI.

View solution in original post

1 REPLY 1

avatar
Super Collaborator

Hi @Abhijeet Rajput,

In response to handling the huge SQL, Spark does lazy evolution

which means you can split your code into multiple blocks and write using the multiple data frames.

That will be evaluated at last and uses the optimal execution plan that can accommodate for the operation.

Example :

var subquery1 = sql (“select
c1,c2,c3 form tbl1 join tbl2 on codition1 and condition 2”)
subquery1.registerTempTable(“res1”)
var subquery2 = sql (“select
c1,c2,c3 form res1 join tbl3 on codition4 and condition 5”)and so on….

On the other request, there is no difference between using the DataFrame base API or SQL as the same execution plan will be generated for both, you can validate the same from DAG schedule while on execution with Spark UI.