Support Questions

Chandra · ‎09-02-2016

Hi,

Can you please let me know which one is faster -Hive on Tez or accessing Hive using Spark SQL.

Thanks,

Chandra

cstanca · ‎09-02-2016

Just to clarify, SparlSQL does not access or use Hive engine. It just consumes the metadata of Hive data structures.

Assuming that both can execute the query functionally (SparkSQL is quite limited functionally compared with Hive), but the query will need to churn through 40 TB of data, then I would say likely Hive on Tez is your optimal choice. That is also driven by the cost associated with your Spark cluster RAM additional to Hive's requirements because I assume that you will still have some cases where running Hive is needed. I noticed that if the amount of data is less than 1 TB, SparkSQL outperforms Hive on Tez.

Anyhow, be aware, that with HDP 2.5 LLAP is in Tech Preview and soon will be GA. If you were asking Hive on LLAP vs. SparkSQL, I would say without hesitation for most of the queries, Hive on LLAP. Again, for some sofisticated queries with limited amount of data, and limited function, SparkSQL may be a winner, but in the big picture is too expensive to maintain both approaches and I would still consider Hive on Tez and LLAP over SparkSQL for most of the cases that deal with BIG DATA. Otherwise, 1 TB does not need Hadoop for fast queries.

Read more about Hive on LLAP here:

http://hortonworks.com/blog/llap-enables-sub-second-sql-hadoop/

Give LLAP a shot before deciding to use SparkSQL, especially, if you already have the queries written in HiveQL.

If this response or any response in this thread was helpful, please don't forget to vote/accept it as the best answer.

View solution in original post

cstanca · ‎09-02-2016

@chandramouli muthukumaran

Just to clarify, SparlSQL does not access or use Hive engine. It just consumes the metadata of Hive data structures.

Assuming that both can execute the query functionally (SparkSQL is quite limited functionally compared with Hive), but the query will need to churn through 40 TB of data, then I would say likely Hive on Tez is your optimal choice. That is also driven by the cost associated with your Spark cluster RAM additional to Hive's requirements because I assume that you will still have some cases where running Hive is needed. I noticed that if the amount of data is less than 1 TB, SparkSQL outperforms Hive on Tez.

Anyhow, be aware, that with HDP 2.5 LLAP is in Tech Preview and soon will be GA. If you were asking Hive on LLAP vs. SparkSQL, I would say without hesitation for most of the queries, Hive on LLAP. Again, for some sofisticated queries with limited amount of data, and limited function, SparkSQL may be a winner, but in the big picture is too expensive to maintain both approaches and I would still consider Hive on Tez and LLAP over SparkSQL for most of the cases that deal with BIG DATA. Otherwise, 1 TB does not need Hadoop for fast queries.

Cloudera Community

Support Questions

HIve on Tez or HIve query using Spark SQL