Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Falcon retention deadlock

Falcon retention deadlock

Contributor

Hi,

we are having serious problems with Falcon feed retention. Current retention implementation creates one Oozie coordinator job for each feed. This coordinator periodically (e.g. daily) starts Oozie workflow job to take care of the feed's retention. This means that in one short interval a large number of YARN jobs is scheduled. The number of jobs exceeds capacity of our queue and jobs are stuck and queue is quickly filled with pending jobs.

After some investigation we believe there is a deadlock happening in the retention processing, since both the oozie launchers and oozie actions are running in the single queue. This queue is quickly filled with launchers, so that no containers are left for actions, but the launchers wait for actions so nothing is actually done and whole queue is frozen.

This is known issue with Oozie that is normally solved be reserving a separate queue just for the launchers so they do not compete with actions, using the 'oozie.launcher.mapred.job.queue.name' property.

However I could not find a way how to tell Falcon to use this separate queues for the retention workflow.

Has anyone seen this? Any ideas?

Thanks,

Pavel

8 REPLIES 8
Highlighted

Re: Falcon retention deadlock

Rising Star

Hi @Pavel Benes, users can specify custom properties for feed/process jobs in the entity. In http://falcon.apache.org/EntitySpecification.html#... , look under section Custom Properties.

In this scenario - please add the following property to feed entity and the problem should be solved.

        <property name="queueName" value="name_of_queue"/>
Highlighted

Re: Falcon retention deadlock

Contributor

Hi @Balu,

thanks for your answer, but we are already using this property. The property controls in what queue the retention queues will run, but the problem is that both oozie launcher and action jobs will run in this single queue. IMO to avoid the deadlock the launcher job needs to run in the other queue than the action job. The retention workflow consists of the two actions, 'eviction' and 'succeeded-post-processing'. Typically the first one passes, but the other one hangs since by the time it is ready to run the queue is all filled with pending launchers for other jobs.

Highlighted

Re: Falcon retention deadlock

Rising Star

Hi @Pavel Benes, It is the same solution. Anytime you want to set a property for a feed coordinator/workflow, just add that to feed properties. So in this case, add the following. I just tested this in my local setup and seems to work for me.

<property name="oozie.launcher.mapred.job.queue.name" value="name_of_launcher_queue" />
<property name="queueName" value="name_of_default_queue"/>
Highlighted

Re: Falcon retention deadlock

Rising Star

@Pavel Benes : I take that back. The retention coordinator and workflow configs had the property "oozie.launcher.mapred.job.queue.name" set correctly. But in my test setup, the oozie launcher still ran in default queue. I am debugging this at this moment.

Highlighted

Re: Falcon retention deadlock

Contributor

@Balu: I have taken a deep dive into Falcon source and found something interesting. This is a fragment from the eviction-action.xml template java Oozie action:

<configuration>
   <property>
   <name>mapred.job.queue.name</name>
   <value>${queueName}</value>
</property>
<property>
   <name>oozie.launcher.mapred.job.priority</name>
   <value>${jobPriority}</value>
   </property>
   ...
</configuration> 

As you can see the 'oozie.launcher.mapred.job.queue.name' is not here, but some 'oozie.launcher.*' stuff is. I would expect it to be here if it should be passed to the retention workflow.

Thanks.

Highlighted

Re: Falcon retention deadlock

Rising Star

@Pavel Benes - All properties that are added to the Feed will be made part of job configuration that is passed on to retention workflow. I tested this and can confirm it works. What I need to find out is why Oozie is using "mapred.job.queue.name" for launcher when "oozie,launcher.mapred.job.queue.name" property is available.

Highlighted

Re: Falcon retention deadlock

Rising Star

Hi @Pavel Benes, I confirmed that the following solution works. Please add the following properties to feed entity definition.

<propertyname="oozie.launcher.mapreduce.job.queue.name"value="name_of_launcher_queue"/>

<propertyname="queueName"value="name_of_default_queue"/>

This will add the properties to the feed job configuration, and ensure that launcher jobs run in a different queue.

Highlighted

Re: Falcon retention deadlock

Mentor

@Pavel Benes has this been resolved? Can you post your solution or accept best answer?

Don't have an account?
Coming from Hortonworks? Activate your account here