Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

spark sql interaction with hive doubts

Solved Go to solution

spark sql interaction with hive doubts

New Contributor

Hi, Im studing the interaction of spark with hive, to execute queries over hive tables with spark sql using hiveContex. But, Im having some doubts to understanding the logic.

From the spark documentation, the basic code for this is this:

// sc is an existing SparkContext. 
var hiveContext = new org.apache.spark.sql.hive.HiveContext(sc) 
var query = hiveContext.sql("select * from customers"); 
query.collect() 

I have three main doubts. I read that spark works with rdds, and then spark can apply actions or transformations in that rdds.

1) It seems that we can create a rdd by loading an external dataset, so in this above code where the RDD is created? Is here "query = hiveContext.sql("select * from customers");" ? var query is the RDD?

2) And then after the RDD is created we can do transformations and actions, but in this case of execute queries over hive tables we just do actions right? There is no need for transformations right? And the action here is collect() right?

3) And third, I also read that spark computes rdds in a lazy way to save storage space. In this use case of execute queries over hive tables with above code, where or how this lazy evaluation mechanism happens, so the spark can save storage space?

Can you give some help to understand this better?

1 ACCEPTED SOLUTION

Accepted Solutions

Re: spark sql interaction with hive doubts

Expert Contributor

1. query is going to be a dataframe (an RDD with some schema attached to it).

2. Once the dataframe is created, you can apply RDD transformations like map, or you can apply dataframe operations to it, like query.select("name") if name was a column in the original table. When you create a dataframe from a Hive table, its using the metastore attached to hive to get information about the file, like location, format. Spark doesn't actually use hive for any processing. The sql query passed in is actually processed (through an optimizer) and turned into Spark code. When working dataframes, you can transform the data if you'd like, it just like any other RDD with the exception is has a schema attached to it. The action in your care is indeed collect(), though show() is the dataframe action which will give you a much nicer table to look at.

3. Spark SQL + Hive still has the concept of lazy evaluation. The lazy evaluation just means that Spark keeps track of all the transformations you did, and reads/processes data through those transformations when you call an action.

4 REPLIES 4

Re: spark sql interaction with hive doubts

Expert Contributor

1. query is going to be a dataframe (an RDD with some schema attached to it).

2. Once the dataframe is created, you can apply RDD transformations like map, or you can apply dataframe operations to it, like query.select("name") if name was a column in the original table. When you create a dataframe from a Hive table, its using the metastore attached to hive to get information about the file, like location, format. Spark doesn't actually use hive for any processing. The sql query passed in is actually processed (through an optimizer) and turned into Spark code. When working dataframes, you can transform the data if you'd like, it just like any other RDD with the exception is has a schema attached to it. The action in your care is indeed collect(), though show() is the dataframe action which will give you a much nicer table to look at.

3. Spark SQL + Hive still has the concept of lazy evaluation. The lazy evaluation just means that Spark keeps track of all the transformations you did, and reads/processes data through those transformations when you call an action.

Re: spark sql interaction with hive doubts

New Contributor

Thanks for your answer, it really helped understand better the logic. I just have one more doubt about your second answer. So the actions are collect or show in this case. But about transformations, select is a transformation? And also if the query its not only a "select * from customers", but have some operations like group by, filter, join operations, this operations will be transformations that spark will aplply on the dataframe during the query execution?

Re: spark sql interaction with hive doubts

Expert Contributor

Good questions. The ".sql" before the query is the transformation in this case. "Select * from customers" is telling the .sql transformation what to do. You can definitely include more advanced queries in the .sql transformation. Spark will parse the sql query and convert it back to Spark code, so all data will still be processed with Spark.

Re: spark sql interaction with hive doubts

New Contributor

Thank you really, it helped a lot understand better this.