Support Questions

Freakabhi · ‎06-14-2017

All,

I would like to get the suggestions and correct way to convert very large queries like ( 1000 lines ) joining 10+ tables and complicated transforms to Py-Spark program

and also I have question regarding writing SparkSQL program, is there difference of performance between writing

1) SQLContext.sql("select count(*) from (select distinct col1,col2 from table))") 2) using pyspark Api : df.select("col1,col2").distinct().count().

I am from SQL background and we are working on converting existing logic to hadoop, hence SQL is handy.

bkosaraju · ‎06-15-2017

Hi @Abhijeet Rajput,

In response to handling the huge SQL, Spark does lazy evolution

which means you can split your code into multiple blocks and write using the multiple data frames.

That will be evaluated at last and uses the optimal execution plan that can accommodate for the operation.

Example :

var subquery1 = sql (“select
c1,c2,c3 form tbl1 join tbl2 on codition1 and condition 2”)
subquery1.registerTempTable(“res1”)
var subquery2 = sql (“select
c1,c2,c3 form res1 join tbl3 on codition4 and condition 5”)and so on….

On the other request, there is no difference between using the DataFrame base API or SQL as the same execution plan will be generated for both, you can validate the same from DAG schedule while on execution with Spark UI.

View solution in original post

bkosaraju · ‎06-15-2017

Hi @Abhijeet Rajput,

In response to handling the huge SQL, Spark does lazy evolution

which means you can split your code into multiple blocks and write using the multiple data frames.

That will be evaluated at last and uses the optimal execution plan that can accommodate for the operation.

Example :

var subquery1 = sql (“select
c1,c2,c3 form tbl1 join tbl2 on codition1 and condition 2”)
subquery1.registerTempTable(“res1”)
var subquery2 = sql (“select
c1,c2,c3 form res1 join tbl3 on codition4 and condition 5”)and so on….

On the other request, there is no difference between using the DataFrame base API or SQL as the same execution plan will be generated for both, you can validate the same from DAG schedule while on execution with Spark UI.

Cloudera Community

Support Questions

Pyspark SQL - Large Queries

Spark (PySpark) to extract from SQL Server

Hbase filter query using pyspark

How to Create an Iceberg Table with PySpark in Clo...

Querying large datasets in Cloudera - Emmanuel Kat...

Tuning Large Hive Queries - Part 1

Using VirtualEnv with PySpark

Using VirtualEnv with PySpark

Running PySpark with Conda Env

How to save all the output of pyspark sql query in...

Distributed XGBoost with PySpark in Cloudera Machi...