Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Can I use SparkSQL to do complex joins and sorting ?

avatar
New Contributor

We have different tables in Hive and we are processing the data using HQL which includes some complex joins between multiple tables and on multiple conditions. Now we are planning to migrate to Spark 2.0.2. so can I use sparkSQL and use the same HQL query? or do I need to get the data to different DataFrames first and perform Joins and apply other operations on DataFrames instead of doing HQL? what is the better approach?

3 REPLIES 3

avatar

Generically speaking, yes, I'd just run the query that is built upon your Hive tables as Spark SQL is going to "figure out" what it needs to do in its optimizer before doing any work anyway. If the performance is within your SLA then I'd just go with that, but of course, you could always then use that as a baseline to do some comparisons with if/when you do some other approaches in your code. Happy Hadooping (eh hem... Sparking!) and good luck!

avatar
New Contributor

Thanks for answering , However I have one more question here, In the existed HiveQL we are creating some temporary hive work tables to store the intermediate data and dropping those off at the end. In Spark QL instead of those temporary tables I can use

createOrReplaceTempView("XXXX") to create a temporary in memory view . At any point while my data is growing what happens if this TempView can't fit in the memory. Will my job fails ? What I need to do to tackle these kind of scenarios? Appreciate your reply !!!

avatar

Unfortunately, it is a bit more complicated than all of that. In general, Spark is lazy executed so depending on what you do even the "temp view" tables/DataFrame(Set) may not stay around from DAG to DAG. There is an explicit cache method you can use on a DataFrame(Set), but even then you may be trying to cache something that simply won't fit in memory. No worries, Spark assumes that your DF(S)/RDD collections won't fit and it inherently handles this.

I'm NOT trying to sell you on anything, but probably some deeper learnings could help you. I'm a trainer here at Hortonworks (and again, not really trying to sell you something, but pointing to a resource/opportunity) and we spend several days building up this knowledge in our https://hortonworks.com/services/training/class/hdp-developer-enterprise-spark/ class). Again, apologies for being a salesperson, but my general thought was there's still a bit more to learn for you on Spark internals that might take some more interactive ways of building up that knowledge.