I have an oozie flow with high frequency ( every 15 min, ~100 every day )
When an oozie coordinator is started with a dataset he looks for folders in HDFS belonging to the Frequency.
If a folder is not there he goes into WAITING state and checks every minute if the folder has been created yet to start the job.
However since the Check is expensive he only schedules up to 8 ( defined in a parameter THROTTLE ) jobs. So if he is waiting for 8 folders he will not create any more.
However if oozie stopped for a couple days and the folders have still been created in HDFS he has a couple hundred tasks to schedule and they all go into a READY state. This is not heavy, no checks needed anymore however the threads are created apparently and go against the process count of the oozie server job.
Long story short I reached the nproc ulimit of oozie server and he dies a ghastly death. We can try to change the ulimits but it doesn't seem very clean and there would be a limit to this as well.
Anybody knows how to throttle the number of READY jobs in oozie as well?
oozie throttles using "concurrency" and order such as FIFO/LIFO. The execution policies for the actions of a coordinator job can be defined in the coordinator pplication. This should be user-defined, depending on the frequency of a job. And there should be a system-wide, admin-defined cap to protect Oozie's resources from "greedy" users
Timeout: A coordinator job can specify the timeout for its coordinator actions, this is, how long the coordinator action will be in WAITING or READY status before giving up on its execution. Concurrency: A coordinator job can specify the concurrency for its coordinator actions, this is, how many coordinator actions are allowed to run concurrently ( RUNNING status) before the coordinator engine starts throttling them. Execution strategy: A coordinator job can specify the execution strategy of its coordinator actions when there is backlog of coordinator actions in the coordinator engine. The different execution strategies are 'oldest first', 'newest first' and 'last one only'. A backlog normally happens because of delayed input data, concurrency control or because manual re-runs of coordinator jobs. Throttle: A coordinator job can specify the materialization or creation throttle value for its coordinator actions, this is, how many maximum coordinator actions are allowed to be in WAITING state concurrently.
Have a look at this doc
Apparently there is no way to throttle READY jobs however my thread problem seems to be unrelated and a bug. Will close this question.