Basically , Reducer will start once 85% of mapper got completed, Is their any option to start the reducer untill all the Mappers got completed.
Do we have any option to set that parameter in TEZ as well.
Thanks for your respose.. So if reduce the value.. reduce will not start untill all the mappers get completed right.. Any idea what value should be fine to stop the reducer untill all the mappers finish
The setting is for percentage of mappers that have to finish before a reducer is started so mapred.reduce.slowstart.completed.maps=1.0 will wait till all maps are finished.
You need to set mapred.reduce.slowstart.completed.maps in mapred-site.xml (Percentage base)
If you need reducers to start only after completion of all map tasks you need to set mapred.reduce.slowstart.completed.maps=1.0
Idle setting would be mapred.reduce.slowstart.completed.maps=0.8 (or 0.9) -> reducers to start only after 80% (90% respectively) of map tasks got completed.
In latest version of hadoop (hdp2.4.1) the param name is changed to
Also You can set this param on per Job basis.
You are actually correct Tez has two other parameters:
So I suppose if you set both to 1.0 it should have the same effect.
Now you should be a bit careful with that. Tez on Hive does some magic like keeping containers around in case they are needed later on (tez.am.container.idle.release-timeout-min.millis) so changing that parameter might just mean that some containers are idle for a while.
Hi Benjamin, Making reducer to start after the mapper completes 100% will give any performance improvement ? Or is this is a BEST PRACTICE ? Can you please suggest
The reason you have the early start is because Reducers can start copying over data from already finished map tasks while the remainder of map tasks finish. So best practice is to have them gradually ramped up as is the default. This will make the query finish faster. That is why these parameters exist.
It will not impact the existing job since map tasks are allocated first.
However you might impact OTHER tasks because more tasks are running. So I have disabled it in situations of high concurrency where I wanted the highest possible throughput for all queries. However it depends on your query tez will hang on to containers anyhow for 10seconds so as long as your mappers do not take too long you will not get much benefit. It might be different for very long running mapper/reducers/
That is the reason I don't like the "What is best practice " questions. The answer is always it depends on your concrete situation and queries.