Created on 03-01-2018 02:24 PM - edited 08-17-2019 06:25 PM
I have an Oozie workflow with multiple steps that is used for staging data in HDFS. The workflow is called multiple times -- once for each file I want to stage -- often in quick succession. When that happens, the designated Yarn queue for the workflow reaches capacity and new instantiations of the workflow go to ACCEPTED but not RUNNING status. That makes sense, but once the queue is at capacity, RUNNING jobs stop making progress and are unable to move to the next step. It seems like Yarn won't let go of each step's resources until it moves to the next step, but there aren't enough resources available to allocate to the next step, resulting in a deadlock.
I've tried a number of different configurations, but while there are a lot of options to distribute the workload across multiple queues, I haven't come across any settings that help me to manage deadlocks within a particular queue. What can I do, either from a Yarn configuration standpoint or an application design standpoint, to avoid these sorts of deadlocks? I'm hoping to avoid modifying the code that kicks off these processes to make it monitor cluster resources; is there some way to make this work so I can just throw all of my executions onto the Yarn queue, but have them process successfully in a FIFO manner?
I've attached a snapshot of my current queue settings. One other point to note is that in my Oozie workflows, the entire job is allocated to the "staging" queue, it doesn't vary by action. Is that possibly a problem?
Created 03-03-2018 01:37 PM
Hi,
What are the RUNNING jobs that stop making progress? In particular are these jobs oozie-launcher jobs? You can check that by looking at the name of the jobs in the Scheduler view of the RM UI.
Oozie has a non-intuitive way of launching jobs due to legacy behaviors from Hadoop 1 (note that this will be fixed with Oozie 5.x). In short, an Oozie action will launch an oozie-launcher job (1 AM container + 1 mapper) that will be responsible for actually launching the job that you defined in your Oozie action. In the end, your Oozie action will actually require two jobs from YARN point of view. When running multiple workflows at the same time, you could end with a lot of oozie-launcher jobs filling up the queue capacity and preventing the actual jobs to be launched (they will remain in ACCEPTED state). A common practice is to have a dedicated queue for oozie-launcher jobs created by Oozies workflows. This way you prevent this kind of deadlock situation. IIRC, you can set the queue for oozie-launcher jobs using oozie.launcher.mapred.job.queue.name.
Hope this helps.
Created 03-03-2018 01:37 PM
Hi,
What are the RUNNING jobs that stop making progress? In particular are these jobs oozie-launcher jobs? You can check that by looking at the name of the jobs in the Scheduler view of the RM UI.
Oozie has a non-intuitive way of launching jobs due to legacy behaviors from Hadoop 1 (note that this will be fixed with Oozie 5.x). In short, an Oozie action will launch an oozie-launcher job (1 AM container + 1 mapper) that will be responsible for actually launching the job that you defined in your Oozie action. In the end, your Oozie action will actually require two jobs from YARN point of view. When running multiple workflows at the same time, you could end with a lot of oozie-launcher jobs filling up the queue capacity and preventing the actual jobs to be launched (they will remain in ACCEPTED state). A common practice is to have a dedicated queue for oozie-launcher jobs created by Oozies workflows. This way you prevent this kind of deadlock situation. IIRC, you can set the queue for oozie-launcher jobs using oozie.launcher.mapred.job.queue.name.
Hope this helps.
Created 03-04-2018 01:49 AM
Thanks so much, Pierre. The jobs that stop making progress are the jobs kicked off by the oozie-launcher jobs, which then in turn causes the oozie-launcher jobs to hang as well. I had been wondering if those oozie-launcher jobs were the issue; I'll try creating a separate queue for them and report back.
Created 03-04-2018 04:34 AM
This seems to have resolved the problem, thanks again!