Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Hive Performance Tuning Parameters

avatar
Explorer

Hi all,

 

I am new to Hive, was told below parameter used to improve hive performance, if i were to set and run those code sequence as below, does the sequence matters and correlated to each other? Do we need to put up these code whenever run the query? Or execute once will be sufficient? Thanks.

 

 

Set hive.exec.parallel = true;
set hive.vectorized.execution.enabled = true;
set hive.vectorized.execution.reduce.enabled = true;
set hive.cbo.enable=true;
set hive.compute.query.using.stats=true;
set hive.stats.fetch.column.stats=true;
set hive.stats.fetch.partition.stats=true;
set mapred.compress.map.output = true;
set mapred.output.compress= true;
SET hive.exec.dynamic.partition = true;
SET hive.exec.dynamic.partition.mode = nonstrict;
set hive.auto.convert.join = false;
set hive.exec.max.dynamic.partitions=100000;
set hive.exec.max.dynamic.partitions.pernode=10000;

 

 

 

1 ACCEPTED SOLUTION

avatar
Cloudera Employee

Hi,

I have some points to that questions:

  1. The order of the parameter does not matter
  2. If you do not persist the settings in the configuration, you have to apply them at the start of each session
  3. Those parameters are not the holy grail.
  4. Vectorized execution can lead to errors and wrong results under specific circumstances and should only be used if it is required and known to work with the used UDFs.
  5. Using CBO/fetching stats can improve your performance. Under the wrong circumstances - it can lead to a long gathering period for stats at the end of a query that maybe is worse than the performance gain.
  6. Auto convert join should only be used if you know the sizes of the tables. Setting this property to true will trigger mapjoin only if one table fits in your memory otherwise there will be no use in setting this to true and you will not find any optimization in your runtime.

 

I really can recommend you that article by a fellow Clouderan:
https://community.cloudera.com/t5/Community-Articles/Hive-on-Tez-Performance-Tuning-Determining-Redu...

 

If you have concrete questions to optimize a specific query do not hesitate to ask. 

View solution in original post

1 REPLY 1

avatar
Cloudera Employee

Hi,

I have some points to that questions:

  1. The order of the parameter does not matter
  2. If you do not persist the settings in the configuration, you have to apply them at the start of each session
  3. Those parameters are not the holy grail.
  4. Vectorized execution can lead to errors and wrong results under specific circumstances and should only be used if it is required and known to work with the used UDFs.
  5. Using CBO/fetching stats can improve your performance. Under the wrong circumstances - it can lead to a long gathering period for stats at the end of a query that maybe is worse than the performance gain.
  6. Auto convert join should only be used if you know the sizes of the tables. Setting this property to true will trigger mapjoin only if one table fits in your memory otherwise there will be no use in setting this to true and you will not find any optimization in your runtime.

 

I really can recommend you that article by a fellow Clouderan:
https://community.cloudera.com/t5/Community-Articles/Hive-on-Tez-Performance-Tuning-Determining-Redu...

 

If you have concrete questions to optimize a specific query do not hesitate to ask.