Support Questions

newbieone · ‎08-17-2022

Hi all,

I am new to Hive, was told below parameter used to improve hive performance, if i were to set and run those code sequence as below, does the sequence matters and correlated to each other? Do we need to put up these code whenever run the query? Or execute once will be sufficient? Thanks.

Set hive.exec.parallel = true;
set hive.vectorized.execution.enabled = true;
set hive.vectorized.execution.reduce.enabled = true;
set hive.cbo.enable=true;
set hive.compute.query.using.stats=true;
set hive.stats.fetch.column.stats=true;
set hive.stats.fetch.partition.stats=true;
set mapred.compress.map.output = true;
set mapred.output.compress= true;
SET hive.exec.dynamic.partition = true;
SET hive.exec.dynamic.partition.mode = nonstrict;
set hive.auto.convert.join = false;
set hive.exec.max.dynamic.partitions=100000;
set hive.exec.max.dynamic.partitions.pernode=10000;

Bilbao · ‎08-22-2022

Hi,

I have some points to that questions:

The order of the parameter does not matter
If you do not persist the settings in the configuration, you have to apply them at the start of each session
Those parameters are not the holy grail.
Vectorized execution can lead to errors and wrong results under specific circumstances and should only be used if it is required and known to work with the used UDFs.
Using CBO/fetching stats can improve your performance. Under the wrong circumstances - it can lead to a long gathering period for stats at the end of a query that maybe is worse than the performance gain.
Auto convert join should only be used if you know the sizes of the tables. Setting this property to true will trigger mapjoin only if one table fits in your memory otherwise there will be no use in setting this to true and you will not find any optimization in your runtime.

I really can recommend you that article by a fellow Clouderan:
https://community.cloudera.com/t5/Community-Articles/Hive-on-Tez-Performance-Tuning-Determining-Redu...

If you have concrete questions to optimize a specific query do not hesitate to ask.

View solution in original post

Bilbao · ‎08-22-2022

Hi,

I have some points to that questions:

The order of the parameter does not matter
If you do not persist the settings in the configuration, you have to apply them at the start of each session
Those parameters are not the holy grail.
Vectorized execution can lead to errors and wrong results under specific circumstances and should only be used if it is required and known to work with the used UDFs.
Using CBO/fetching stats can improve your performance. Under the wrong circumstances - it can lead to a long gathering period for stats at the end of a query that maybe is worse than the performance gain.
Auto convert join should only be used if you know the sizes of the tables. Setting this property to true will trigger mapjoin only if one table fits in your memory otherwise there will be no use in setting this to true and you will not find any optimization in your runtime.

I really can recommend you that article by a fellow Clouderan:
https://community.cloudera.com/t5/Community-Articles/Hive-on-Tez-Performance-Tuning-Determining-Redu...

If you have concrete questions to optimize a specific query do not hesitate to ask.

Cloudera Community

Support Questions

Hive Performance Tuning Parameters

SQOOP Performance tuning

Tips and best practices for optimizing Hive perfor...

Tuning Hbase for optimized performance ( Part 1 )

Tuning Hbase for optimized performance ( Part 2 )

Ambari Server Performance Tuning & Troubleshooting...

Tuning Hbase for optimized performance ( Part 4 )

Tuning Hbase for optimized performance ( Part 5 ) ...

Demystify Apache Tez Memory Tuning - Step by Step

Hive profiling and query performance tuning tool

Rdd/DataFrame/DataSet Performance Tuning