I have a newly set up small cluster (32 nodes, 128 cores, running in EC2) using CDH 5.2 with Cloudera Manger. I'm new to Yarn/MR2 and am having some trouble figuring out how to configure the Yarn resource queues. Right now, we're running with the defaults -- no min/max and each user getting their own sub-queue.
The defaults clearly aren't working. We frequently get a deadlock situation when two large jobs (~4000 mappers, 70-100 reducers) are running at the same time. The jobs will run up to a point where they hang because the reducers get slowstarted but then Yarn runs out of available containers to allocate to the pending incomplete maps. I can't seem to find any way (unlike MR1) where I can dedicate slots to mappers/reducers to prevent one from starving out the other.
I was thinking of disabling undeclared pools, then creating two resource pools, a default pool with DRF scheduling dedicated to small, ad hoc jobs, and a separate queue for long-running tasks using FIFO scheduling. I have no idea how to set the min/max limits for these.
Any thought on this plan? If anyone has better suggestions or pointers to somewhere that does, I'd really appreciate hearing it.
I think you are mixing two things there: the issues between mappers and reducers and issues between small and long running jobs.
For the latter it is a good idea to define separate queues (by default you only need to have separate users submitting the jobs and separate queues will be created), even if you don't put restrictions, the FairScheduler (which is default in several distributions, not sure about EC2) will probably do the trick and prevent the small ad hoc jobs from starving.
About the issue between mappers-reducers, I don't think queues can help you there.
What I would do is first try to reduce the number of mappers, I think it's too high. Bear in mind each container allocation is an overhead. For your cluster, maybe 400 mappers and 20 reducers (depending on the number of keys you have) would probably do better and avoid the lock issue altogether.
If you cannot or don't want to change the number then try changing the property mapreduce.reduce.slowstart.completed.maps.
By default is very aggresive and starts running reducers (just so they start moving data) when only 5% of the mappers have finished.
Try increasing it but I still think reducing the number of mappers/reducers is a better solution.
Thanks for the suggestions. Decreasing mappers/reducers works, but isn't an ideal solution, since that's up to the discretion of the developer submitting jobs (this is a shared cluster). I can try to explain why one shouldn't use so many, but I'm concerned that leaving it up to best practices to prevent deadlock is going to be a maintenance nightmare.
For now, I've just turned off slowstart entirely, which is also probably best for maximizing throughput on this small cluster. Individual jobs might take longer, but we haven't seen the deadlock problem arise since.
I hav the same problem. 16 jobs waiting for free slot. What is the right way to overcome this problem? Never had such problem on MR1
Only thing I found that worked consistently was to kill any deadlocked jobs, then set the default slow start percentage to 0 to prevent reducers from spinning up and grabbing containers before the mappers all finish.
Is there any possibility to run a kind of fir scheduler in YARN? "slow start percentage to 0"
would bring serious perfomance problems.
Cloudera uses the fair scheduler by default but is not able to solve deadlock.
The slow start has a very modest impact. Bear in mind reducers can get a headstart on copying but cannot start until all mappers are done.
Thanks, I'm going to limit concurrent number of jobs.
slow start should help when you have significant intermediate output. I have mahout jobs which do transfer alot from map to reduce side. I've used Fair scheduler on CDH 4.4 and never met any problem. It hust worked. Need some time to deal with YARN stuff. Thanks!
I forgot something. You have to reduce also yarn.scheduler.minimum-allocation-mb to 512MB to save memory on the AM setting, otherwise it will allocate 1GB anyway