Support Questions

Find answers, ask questions, and share your expertise
Announcements
Check out our newest addition to the community, the Cloudera Data Analytics (CDA) group hub.

Improve Hive on Tez performance [HDP 2.6.4]

New Contributor

Hi,

 

I was wondering how to improve / setup Tez in order to achieve performance I get when using Spark / Spark SQL.

 

Currently, I have a table that I need to scan and grab all the data matching certain column. The table is partitioned daily and I have ~100,000,000 rows per day. In Spark SQL, a simple spark.sql("select * from table where col=12345 limit 10000").show(false) finishes in 5-10 minutes, while Hive SQL Query (Hive on Tez) works over 20-30 minutes and I then break it. Also worth noting is that Hive SQL Query occupies pretty much 100% of the cluster, while Spark SQL only goes up to 50%. 

 

Cluster is currently running on ~4 TB on Yarn. I can provide more details, I just don't know exactly what to share at the moment.

 

BR,

Dan

1 REPLY 1

Cloudera Employee

Hello @dandaran 

 

There is a great community post here - Demystifying Tez Memory Tuning

Take a Tour of the Community
Don't have an account?
Your experience may be limited. Sign in to explore more.