Support Questions

Find answers, ask questions, and share your expertise

Improve Hive on Tez performance [HDP 2.6.4]

New Contributor

Hi,

 

I was wondering how to improve / setup Tez in order to achieve performance I get when using Spark / Spark SQL.

 

Currently, I have a table that I need to scan and grab all the data matching certain column. The table is partitioned daily and I have ~100,000,000 rows per day. In Spark SQL, a simple spark.sql("select * from table where col=12345 limit 10000").show(false) finishes in 5-10 minutes, while Hive SQL Query (Hive on Tez) works over 20-30 minutes and I then break it. Also worth noting is that Hive SQL Query occupies pretty much 100% of the cluster, while Spark SQL only goes up to 50%. 

 

Cluster is currently running on ~4 TB on Yarn. I can provide more details, I just don't know exactly what to share at the moment.

 

BR,

Dan

1 REPLY 1

Cloudera Employee

Hello @dandaran 

 

There is a great community post here - Demystifying Tez Memory Tuning