Created on 03-11-201612:43 AM - edited 08-17-201901:03 PM
How Does Tez determine the number of reducers? How can I control this for performance?
In this article, I will attempt to answer this while executing and tuning an actual query to illustrate the concepts. Then I will provide a summary with a full explanation. if you wish, you can advance ahead to the summary.
We setup our environment, turning CBO and Vectorization On.
set hive.vectorized.execution.reduce.enabled = true;
set hive.vectorized.execution.reduce.groupby.enabled = true;
We create Orc tables and did an Insert Overwrite into Table with Partitions
#There is a danger with many partition columns to generate many broken files in ORC. To prevent that
> set hive.optimize.sort.dynamic.partition=true;
#if hive jobs previously ran much faster than in the current released
version, look into potentially setting property
> hive.optimize.sort.dynamic.partition = false .
> insert overwrite table benchmark_rawlogs_orc partition (p_silo,p_day,p_clienthash)
select * FROM <original table>;
We generated the statistics we needed for use in the Query Execution
-- // generate statistics for the ORC table
-- // To Generate Statistics for Entire Table and Columns for All Days (Longer)
ANALYZE TABLE rawlogs.benchmark_rawlogs_orc partition (p_silo, p_day, p_clienthash) COMPUTE STATISTICS;
ANALYZE TABLE rawlogs.benchmark_rawlogs_orc partition (p_silo, p_day, p_clienthash) COMPUTE STATISTICS for columns;
Let's look at the relevant portions of this explain plan. We see in Red that in the Reducers stage, 14.5 TB of data, across 13 million rows are processed. This is a lot of data to funnel through just two reducers.
The final output of the reducers is just 190944 bytes (in yellow), after initial group bys of count, min and max.
We need to increase the number of reducers.
3. Set Tez Performance Tuning Parameters
When Tez executes a query, it initially determines the number of reducers it needs and automatically adjusts as needed based on the number of bytes processed.
- Manually set number of Reducers (not recommended)
To manually set the number of reduces we can use parameter mapred.reduce.tasks.
By default it is set to -1, which lets Tez automatically determine the number of reducers.
However you are manually set it to the number of reducer tasks (not recommended)
> set mapred.reduce.tasks = 38;
It is better let Tez determine this and make the proper changes within its framework, instead of using the brute force method.
> set mapred.reduce.tasks = -1;
- How to Properly Set Number of Reducers
First we double check if auto reducer parallelism is on. The parameter is hive.tez.auto.reducer.parallelism
#Turn on Tez' auto reducer parallelism feature. When enabled, Hive will still estimate data sizes and set parallelism estimates. Tez will sample source vertices' output sizes and adjust the estimates at runtime as necessary.
> set hive.tez.auto.reducer.parallelism;
> set hive.tez.auto.reducer.parallelism = true;
This is the first property that determines the initial number of reducers once Tez starts the query.
Then, there are two boundary parameters
#When auto reducer parallelism is enabled this factor will be used to put a lower limit to the number of reducers that Tez specifies.
#When auto reducer parallelism is enabled this factor will be used to over-partition data in shuffle edges.
More on this parameter later.
The third property is hive.exec.reducers.max which determines the maximum number of reducers. By default it is 1099.
The final parameter that determines the initial number of reducers is hive.exec.reducers.bytes.per.reducer
By default hive.exec.reducers.bytes.per.reducer is set to 256MB, specifically 258998272 bytes.
So to put it all together Hive/ Tez estimates
number of reducers using the following formula and then schedules the Tez DAG.
Max(1, Min(hive.exec.reducers.max , ReducerStage estimate/hive.exec.reducers.bytes.per.reducer)) x hive.tez.max.partition.factor 
So in our example since the RS output is 190944 bytes, the number of reducers will be:
> Max(1, Min(1099, 190944/258998272)) x 2
> Max (1, Min(1099, 0.00073)) x 2 = 1 x 2 = 2
Hence the 2 Reducers we initially observe.
4. Increasing Number of Reducers, the Proper Way
Let's set hive.exec.reducers.bytes.per.reducer to 10 MB about 10432
The new number of reducers count is
> Max(1, Min(1099, 190944/10432)) x 2
> Max (1, Min(1099, 18.3)) x 2 = 19 (rounded up) x 2 = 38
indicates that the decision will be made between 25% of mappers
finishing and 75% of mappers finishing, provided there's at least 1Gb of
data being output (i.e if 25% of mappers don't send 1Gb of data, we will wait till at least 1Gb is sent out).
a decision has been made once, it cannot be changed as some reducers
will already be running & might lose state if we do that. You can
get more & more accurate predictions by increasing the fractions.
Hive-2.0 (only) improvements
Now that we have a total # of reducers, but you might not have capacity to run all of them at the same time - so you need to pick a few to run first, the ideal situation would be to start off the reducers which have the most amount of data (already) to fetch first, so that they can start doing useful work instead of starting reducer #0 first (like MRv2) which may have very little data pending.
The first flag there is pretty safe, but the second one is a bit more dangerous as it allows the reducers to fetch off tasks which haven't even finished (i.e mappers failing cause reducer failure, which is optimistically fast, but slower when there are failures – bad for consistent SLAs).
Finally, we have the sort buffers which are usually tweaked & tuned to fit, but you can make it much faster by making those allocations lazy (i.e allocating 1800mb contigously on a 4Gb container will cause a 500-700ms gc pause, even if there are 100 rows to be processed).