In a YARN operated hadoop cluster, what is the best way to get a real-time picture of resource allocations and then perform allocations dynamically? To my knowledge, YARN does only static resource allocations, in the sense that, once a job is given resources, the next job has to wait till the earlier job completes.
For example, assume two jobs are submitted to the cluster one after the one. Suppose first job (Pig) has a resource need of 4 containers, each with 2GB memory. The second job (Hive) has a requirement for 2 containers, each with 4GB memory. Also, assume, minimum container size on YARN has been set to 4GB.
In the above setup, once first job starts running, the second job cant be accommodated as YARN 'thinks' there are no containers to accommodate the job. However, if there is a way to adjust resources dynamically, such that the first job only consumes 2GB containers, then the second job as well can run alongside.
Just wondering how this is achieved in the present YARN setup. I understand that products like PepperData help to resolve such cases. However, what is the best way to achieve natively?
First we have to mention how jobs in hadoop normally work. Single tasks should not run for a very long time ( or you are doing something wrong like having a single reducer for a big job ) but you may have big jobs with thousands of tasks that can block a cluster for a long time.
To make sure a single task doesn't completely take over a cluster you have different options:
a) Create queues
If you have two queues you can for example set them both to 50% capacity and 100% max capacity. The first huge application would take the full cluster but as its containers finish the free slots would be assigned to the other job till both have 50% of the cluster. Good in most situations.Queues can be very fine tuned with sub queues etc.
b) Change the ordering policy in a queue.
Per default the ordering policy is FIFO i.e. job1 with 10000 tasks will block the full queue till all 10000 tasks finish then job2 starts. However you can change this ordering policy to Fair ( not to be confused with fair scheduler this is still capacity scheduler). If it is set to fair free task slots will be given to the other jobs waiting in the queue so all of them run. This is easier to setup than multiple queues but doesn't give you the ability to set ratios
The above works well normally when well designed jobs with tasks that take under a minute are running. However sometimes bad jobs come in that have endless running tasks or you are rtunning things like spark streaming that per definition run forever. In this case you can enable preemption. If a queue takes too much resources the scheduler will start shooting down tasks from it after a given period of time and give it to the other queues. Using queues and preemption you can guarantee fair distribution of resources but you might impact running jobs.
d) Long running applications
The above works well for normal HAdoop jobs. I.e. jobs that are made up of small tasks that can be re-run. However some applications require a specific amount of containers to run in the first place like Spark Streaming. If you shoot down one of those containers you need to wait till he can be rescheduled for the application in total to run. The problem is that yarn cannot really do anything here since it doesn't by definition know what the applications are doing with the containers. So reducing the memory needs of something like that would need to be supported by the application. I don't think Spark streaming for example supports it, you would need to stop and restart the app. Others might Slider is a framework to package long running applications and run them on yarn and you can normally "flex" up or down applications. I.e. reduce or increase their container needs. Slider will then stop/start containers in a way to increase/reduce the total memory needs.
Hope that helped.