Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

HIve on Tez or HIve query using Spark SQL

avatar
Expert Contributor

Hi,

Can you please let me know which one is faster -Hive on Tez or accessing Hive using Spark SQL.

Thanks,

Chandra

1 ACCEPTED SOLUTION

avatar
Super Guru

@chandramouli muthukumaran

Just to clarify, SparlSQL does not access or use Hive engine. It just consumes the metadata of Hive data structures.

Assuming that both can execute the query functionally (SparkSQL is quite limited functionally compared with Hive), but the query will need to churn through 40 TB of data, then I would say likely Hive on Tez is your optimal choice. That is also driven by the cost associated with your Spark cluster RAM additional to Hive's requirements because I assume that you will still have some cases where running Hive is needed. I noticed that if the amount of data is less than 1 TB, SparkSQL outperforms Hive on Tez.

Anyhow, be aware, that with HDP 2.5 LLAP is in Tech Preview and soon will be GA. If you were asking Hive on LLAP vs. SparkSQL, I would say without hesitation for most of the queries, Hive on LLAP. Again, for some sofisticated queries with limited amount of data, and limited function, SparkSQL may be a winner, but in the big picture is too expensive to maintain both approaches and I would still consider Hive on Tez and LLAP over SparkSQL for most of the cases that deal with BIG DATA. Otherwise, 1 TB does not need Hadoop for fast queries.

Read more about Hive on LLAP here:

http://hortonworks.com/blog/llap-enables-sub-second-sql-hadoop/

Give LLAP a shot before deciding to use SparkSQL, especially, if you already have the queries written in HiveQL.

If this response or any response in this thread was helpful, please don't forget to vote/accept it as the best answer.

View solution in original post

3 REPLIES 3

avatar
Super Guru

@chandramouli muthukumaran

Just to clarify, SparlSQL does not access or use Hive engine. It just consumes the metadata of Hive data structures.

Assuming that both can execute the query functionally (SparkSQL is quite limited functionally compared with Hive), but the query will need to churn through 40 TB of data, then I would say likely Hive on Tez is your optimal choice. That is also driven by the cost associated with your Spark cluster RAM additional to Hive's requirements because I assume that you will still have some cases where running Hive is needed. I noticed that if the amount of data is less than 1 TB, SparkSQL outperforms Hive on Tez.

Anyhow, be aware, that with HDP 2.5 LLAP is in Tech Preview and soon will be GA. If you were asking Hive on LLAP vs. SparkSQL, I would say without hesitation for most of the queries, Hive on LLAP. Again, for some sofisticated queries with limited amount of data, and limited function, SparkSQL may be a winner, but in the big picture is too expensive to maintain both approaches and I would still consider Hive on Tez and LLAP over SparkSQL for most of the cases that deal with BIG DATA. Otherwise, 1 TB does not need Hadoop for fast queries.

Read more about Hive on LLAP here:

http://hortonworks.com/blog/llap-enables-sub-second-sql-hadoop/

Give LLAP a shot before deciding to use SparkSQL, especially, if you already have the queries written in HiveQL.

If this response or any response in this thread was helpful, please don't forget to vote/accept it as the best answer.

avatar
Expert Contributor

Thanks for your valuable information. So your recommendation is to go for Hive on LLAP rather than SparkSQL. Please correct me if I am wrong.

avatar
Expert Contributor

Also what is the need to run Hive queries on SparkSql when Hive on Tez can run much faster....