If we have around 12 workflows running at the same time, 24 CPU's are consumed. That's all we have in our cluster. When 100% vCPU's are consumed. All processes get stuck.
Is there a statergy in hadoop we can use to delay the start of workflow/processes if the resources are limited rather than having all of them stuck for hours?
These are oozie workflows I am talking about here that have hive and sqoop jobs in them
24 vCPUs is really very little, so you should definitely stagger them by timing them appropriately. You can schedule them with an oozie coordinator over the course of the day,
It might be worth reviewing your container sizes to make sure you don't have very large containers. If you don't use SmartSense (which will calculate these for you automatically), you can manually calculate using this doc.
Also, have you tried setting up YARN capacity scheduler queues in order to separate and prioritise different jobs? See the YARN Resource Management guide for more info on that.
I have an m4*2 large , c4*4 land m4*4. That gives me 24 vCPU's on AWS servers. I hope these are the vCpu's we are talking about here? I thought it should be enough. Stagger them out ofcourse yes. was hoping to be able to deal with process halting more effectively. I am currently using fair scheduling and that is pre-emptive enabled.
Depending on the scale of the jobs, it might be enough or it might not. It's hard to tell without knowing what you're doing.
If you're using Fair Scheduler as opposed to the Capacity Scheduler, that might be your problem - see this bug YARN-5774. Check yarn.scheduler.minimum-allocation-mb and also make sure your container sizes are appropriate as discussed.