- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
How to increase performance of Tez in hive
- Labels:
-
Apache Hive
-
Apache Tez
-
Apache YARN
Created 07-19-2017 09:46 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I cluster 2 nodes with each node 16GB Ram and 2-CPU Cores and yarn memory has been configured 8GB as recommended. I have set those configuration in tez as below mentioned link;
I don't know how to configure as my hardware configuration. I'm over process data around 6million record. It takes much time. Please tell me how do i configure to execute as faster/increase performance.
Thanks in advance!
Regards,
Varun
Created 07-19-2017 05:26 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Created 07-19-2017 06:09 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Below are few parameters which one would use in general. Ofcourse based on the logics other configuration settings cab also be modified.
Enable Compression in Hive
hive.exec.compress.output TRUE
hive.exec.compress.intermediate TRUE
hive.auto.convert.join
hive.auto.convert.join.noconditionaltask
hive.optimize.bucketmapjoin
avoid order by and try to use sort by
set hive.vectorized.execution.enabled = true;
set hive.vectorized.execution.reduce.enabled = true;
set hive.vectorized.execution.reduce.groupby.enabled = true;
hive.compute.query.using.stats
hive.stats.fetch.partition.stats
hive.stats.fetch.column.stats
hive.stats.autogather
ANALYZE TABLE employee COMPUTE STATISTICS FOR COLUMNS;
ANALYZE TABLE employee COMPUTE STATISTICS FOR COLUMNS id, dept;
Created 07-19-2017 06:16 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Take a look at the below articles that cover Tez performance tuning as well as an overview of how it works.
https://community.hortonworks.com/articles/14309/demystify-tez-tuning-step-by-step.html
Created 07-21-2017 08:44 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@Varun R
Optimization varies in every case. Depends on incoming data, file size.
In general please use these setting for fine tuning
- Enable predicate pushdown (PPD) to filter at the storage layer:
SET hive.optimize.ppd=true;
SET hive.optimize.ppd.storage=true
- Vectorized query execution processes data in batches of 1024 rows instead of one by one:
SET hive.vectorized.execution.enabled=true;
SET hive.vectorized.execution.reduce.enabled=true;
- Enable the Cost Based Optimizer (COB) for efficient query execution based on cost and fetch table statistics:
SET hive.cbo.enable=true;
SET hive.compute.query.using.stats=true;
SET hive.stats.fetch.column.stats=true;
SET hive.stats.fetch.partition.stats=true;
Partition and column statistics from fetched from the metastsore. Use this with caution. If you have too many partitions and/or columns, this could degrade performance.
- Control reducer output:
SET hive.tez.auto.reducer.parallelism=true;
Partition table based on necessary column, also bucket the tables(wisely identify the column)
Also depends on how you want to tune your Query, based on Explain Plan. Please check number of Mappers and Reducers spawnned.
Created 07-21-2017 09:19 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@Varun
Please see the below to control number of Reducers
setting MAPRED.REDUCE.TASKS = -1 -- this property lets Tez determine the no of reducers to initiate
hive.tez.auto.reducer.parallelism = true; --this property is enabled to TRUE, hive will estimate data sizes and set parallelism estimates. Tez will sample source vertices, output sizes and adjust the estimates at run time this is the 1st property that determines initial number of reducers once Tez starts the query
hive.tex.min.partition.factor =0.25;-- when auto parallelism enable, this property will be used to put a lower limit to number of reducers that Tez specified
1. hive.tez.max.partition.factor - 2.0; -- this property specifies,over-partition data in shuffle edges
2.hive.exec.reducers.max by default is 1099 --max number of reducers
3.hive.exec.reducers.bytes.per.reducer = 256 MB; which is 268435456 bytes
Now to calculate the number of reducers we will need to put altogether, along with this formula also from Explain plan
we will need to get the size of output, lets assume 200,000 bytes
Max(1, Min(hive.exec.reducers.max [1099], Reducer Stage estimate/hive.exec.reducers.bytes.per.reducer)) x hive.tez.max.partition.factor [2]
Max(1, Min(1099, 200000/268435456)) x 2
=MAX(1,min(1099,0.00074505805)) X 2
=MAX(1,0.0007) X 2 = 1 X 2 = 2 Tez will spawn 2 Reducers.
In this case we can legally make Tez initiate higher number of reducers by modifying value of hive.exec.reducers.bytes.per.reducer by setting it to 20 KB
=Max(1,min(1099,20000/10432)) X 2
=Max(1,19) X 2 = 38
Please note higher number of reducers doesn't mean better performance