Support Questions

Find answers, ask questions, and share your expertise

Spark SQL Internally

avatar
Rising Star

Hi,

Im executing tpc queries over hive tables using Spark SQL as below:

var hiveContext = new org.apache.spark.sql.hive.HiveContext(sc) 
var query = hiveContext.sql(" SELECT ..."); 
query.show

I learn the process about configure and use Spark SQL until this point. But now I would like to learn about how Spark SQL works internally to execute this queries over hive tables, things like execution plans, logical and physical plan, optimization. To understand better how Spark SQL or what Spark SQL uses to decide which is the best execution plan.

Im trying to find information about this but nothing in concrete, someone can give a overview about this so I can understand the basics to try then find more concrete information, or do you know some articles or something that explain this? And also do you know where or what is the command to see the logical and physical plans that Spark SQL uses when exute the queries?

1 ACCEPTED SOLUTION

avatar

First of all, you can show the EXPLAIN PLAN with this syntax:

spark-sql> EXPLAIN SELECT * FROM mytable WHERE key = 1;

Yes, Spark SQL will always use the Catalyst optimizer. In addition, DataFrames operations will also use it now. This is shown in the diagram where the Sql Query (AST parser output) and DataFrames both feed into the Analysis phase of the optimizer.

Also, be aware that there are 2 types of contexts, SQLContext and HiveContext, which provides a superset of the functionality provided by the basic SQLContext. Additional features include the ability to write queries using the more complete HiveQL parser, access to Hive UDFs, and the ability to read data from Hive tables.

View solution in original post

6 REPLIES 6

avatar
Super Guru

I found a nice article by databricks to understand query execution,optimization, logical and physical plan, I hope it will help you.

https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.html

avatar
Rising Star

Thanks for your answer. But Spark SQL uses that catalyst component always? It is part of Spark SQL? Everytime we execute a query it uses that component? And do you know to show the logical and physical plains of the queries?

avatar

First of all, you can show the EXPLAIN PLAN with this syntax:

spark-sql> EXPLAIN SELECT * FROM mytable WHERE key = 1;

Yes, Spark SQL will always use the Catalyst optimizer. In addition, DataFrames operations will also use it now. This is shown in the diagram where the Sql Query (AST parser output) and DataFrames both feed into the Analysis phase of the optimizer.

Also, be aware that there are 2 types of contexts, SQLContext and HiveContext, which provides a superset of the functionality provided by the basic SQLContext. Additional features include the ability to write queries using the more complete HiveQL parser, access to Hive UDFs, and the ability to read data from Hive tables.

avatar
Rising Star

Thanks for your answer, now I can see the plans. And the diagram that appears in the spark user interface about each job, the DAG Visualization what is? Is the logical or physical plan? Or its another thing? And the diagram that you refer in your first phrase is which?

avatar

This is the image showing the phases of the optimizer (from @Rajkumar Singh link above)

4026-spark-catalystoptimizer.png

avatar
Rising Star

Thanks for your help. And do you know if the diagram of the jobs executed after we execute a query, the DAG visualization is about what? That visualization shows the physical or logical plan?