Support Questions

satyap · ‎02-21-2017

Hi,

I know the no of map task is basically determined by no of input files and no of map-splits of these input files.

So if we want to process the 200 files with same block size or even more so we need 200 map-task to process these files and in the case of 1k files we need 1k map task.

How we set the number of reducer for these files apart from setReducenum() or mapreduce.job.task configuration is there any algorithm or logic like hashkey to get the no of reducer.

Secondary ,i want to know that how no of container and require resource is requested by AM to resource manager.

suppose if there is 2gb ram is available in a nodemanager and we submitted a job with 3gb ram then how job will run or it will not run.

If you can give the exact flow with logic from map task to reduce task and till the container assignment then it will be really helpful for me.

mqureshi · ‎02-21-2017

@satya gaurav

number of reducers is determined exactly by mapreduce.job.reduces. This is not just a recommendation. If you have specified a higher number of reducers, container allocation is still done based on queue size for that application. This is determined by your scheduler. Just because you request more than you should doesn't mean that those resources will be allocated. Your reducers will be waiting in queue until other complete. To get more details, you need to understand schedulers (capacity scheduler to be precise).

Minimum container size is given by yarn.scheduler.minimum-allocation-mb (your request for less than this value will still result in a container with this minimum value and not a value you specify if its less than this). Similarly there is an upper limit given by yarn.scheduler.maximum-allocation-mb. Guess what happens if you request more than this? You don't get it. You get assigned this value if you request memory more than this. There are similar settings for core. This is at the cluster level.

For each node, containers are allocated by Node Manager which of course is asking Resource Manager to do its job. yarn.nodemanager.resource.memory-mb is how much memory a container will allocate and yarn.nodemanager.resource.cpu-vcores is for CPU

View solution in original post

mqureshi · ‎02-21-2017

@satya gaurav

Number of reducers is determined by mapreduce.job.reduces. Typically set to 99% of the cluster's reduce capacity, so that if a node fails the reduces can still be executed in a single wave.

As for Application Master, you first need to understand YARN components. There is resource manager which has two main components: Scheduler and Application Manager (NOT Application Master).

The ApplicationsManager is responsible for accepting job-submissions, negotiating the first container for executing the application specific ApplicationMaster and provides the service for restarting the ApplicationMaster container on failure.

The per-application ApplicationMaster has the responsibility of negotiating appropriate resource containers from the Scheduler, tracking their status and monitoring for progress.

Question: suppose if there is 2gb ram is available in a nodemanager and we submitted a job with 3gb ram then how job will run or it will not run.

You need to understand how capacity scheduler works. There are assigned queues for the application(job) and capacity scheduler guarantees certain number of resources available to that application. If resources are available from other queues then your job can borrow those resources. This means if the queue is configured to use 2GB but if your job needs more resources, it can borrow. Please see the section titled "YARN Walkthrough" on following page:

http://hortonworks.com/blog/apache-hadoop-yarn-concepts-and-applications/

screen-shot-2017-02-21-at-121739-am.png

mqureshi · ‎02-21-2017

@satya gaurav

Check the following link. I don't want to repost the items as is, but the explanation here is what you are looking for:

http://www.bigdatanews.com/profiles/blogs/hadoop-yarn-explanation-and-container-memory-allocations

satyap · ‎02-21-2017

@mqureshi,

Thanks a lot! for your explanation.

I am a little bit confuse on the no of map task and no reduce task logic and resource management in hadoop.

As you have written no of reducer can we determined by mapreduce.job.reduces but if we have given more no of reducer then also job will run and resource mangaer will check the resource availability if the resource is available then job will run as no of requested reducers. am i correct? This configuration parameter is just a recommendation for yarn.finall resource manager will take the decision with reference of the available resource.

The most worrying thing how programmer used to decide how many no of reducer they need to proceed the file.

whether they have to calculate it every time before job submission?

mqureshi · ‎02-21-2017

@satya gaurav

number of reducers is determined exactly by mapreduce.job.reduces. This is not just a recommendation. If you have specified a higher number of reducers, container allocation is still done based on queue size for that application. This is determined by your scheduler. Just because you request more than you should doesn't mean that those resources will be allocated. Your reducers will be waiting in queue until other complete. To get more details, you need to understand schedulers (capacity scheduler to be precise).

Minimum container size is given by yarn.scheduler.minimum-allocation-mb (your request for less than this value will still result in a container with this minimum value and not a value you specify if its less than this). Similarly there is an upper limit given by yarn.scheduler.maximum-allocation-mb. Guess what happens if you request more than this? You don't get it. You get assigned this value if you request memory more than this. There are similar settings for core. This is at the cluster level.

For each node, containers are allocated by Node Manager which of course is asking Resource Manager to do its job. yarn.nodemanager.resource.memory-mb is how much memory a container will allocate and yarn.nodemanager.resource.cpu-vcores is for CPU

Cloudera Community

Support Questions

How number of map task and number of reduce task determined by AM and how many containers need to rum a particular job is how determined by application master (AM) ?