Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Falcon retention policy

avatar
Rising Star

Hello,

I have got one question regarding how Falcon implements retention policy for feed instances. I have observed that the retention policy action(i.e. DELETE in my case) is executed only within the dataset's validity interval. It means that several instances (how many it depends of the dataset's frequency) close to end of dataset's validity are kept forever even when some retention is defined.

Is it as expected or I am doing something wrong?

Thanks for any input,

Regards,

Pavel

1 ACCEPTED SOLUTION

avatar
Expert Contributor

Here is a sample feed xml.

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<feed name="rawEmailFeed" description="Raw customer email feed" xmlns="uri:falcon:feed:0.1">
    <tags>externalSystem=USWestEmailServers</tags>
    <groups>churnAnalysisDataPipeline</groups>
    <frequency>hours(1)</frequency>
    <timezone>UTC</timezone>
    <late-arrival cut-off="hours(1)"/>
    <clusters>
        <cluster name="primaryCluster" type="source">
            <validity start="2015-10-30T01:00Z" end="2015-10-30T10:00Z"/>
            <retention limit="hours(10)" action="delete"/>
        </cluster>
    </clusters>
    <locations>
        <location type="data" path="/user/ambari-qa/falcon/demo/primary/input/enron/${YEAR}-${MONTH}-${DAY}-${HOUR}"/>
        <location type="stats" path="/"/>
        <location type="meta" path="/"/>
    </locations>
    <ACL owner="ambari-qa" group="users" permission="0x755"/>
    <schema location="/none" provider="/none"/>
</feed>

In the above example, the validity time is "the time interval when the feed is valid on this cluster". After the validity time ends, falcon is not expected to perform any operations on the feed. The retention job for this feed will be run from validity start time up to validity end time, and will delete any feed instances older than 10 hours. You are correct when you say that some instances of Feed will never be deleted. In the above example, feed instances at between 2015-10-30T00:00Z and 2015-10-30T10:00Z will never be deleted.

If you are expecting all feed instances to be deleted after retention time limit, Falcon should change the way it creates retention coordinator job. I opened a Jira https://issues.apache.org/jira/browse/FALCON-1644 for this.

View solution in original post

2 REPLIES 2

avatar
Expert Contributor

Here is a sample feed xml.

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<feed name="rawEmailFeed" description="Raw customer email feed" xmlns="uri:falcon:feed:0.1">
    <tags>externalSystem=USWestEmailServers</tags>
    <groups>churnAnalysisDataPipeline</groups>
    <frequency>hours(1)</frequency>
    <timezone>UTC</timezone>
    <late-arrival cut-off="hours(1)"/>
    <clusters>
        <cluster name="primaryCluster" type="source">
            <validity start="2015-10-30T01:00Z" end="2015-10-30T10:00Z"/>
            <retention limit="hours(10)" action="delete"/>
        </cluster>
    </clusters>
    <locations>
        <location type="data" path="/user/ambari-qa/falcon/demo/primary/input/enron/${YEAR}-${MONTH}-${DAY}-${HOUR}"/>
        <location type="stats" path="/"/>
        <location type="meta" path="/"/>
    </locations>
    <ACL owner="ambari-qa" group="users" permission="0x755"/>
    <schema location="/none" provider="/none"/>
</feed>

In the above example, the validity time is "the time interval when the feed is valid on this cluster". After the validity time ends, falcon is not expected to perform any operations on the feed. The retention job for this feed will be run from validity start time up to validity end time, and will delete any feed instances older than 10 hours. You are correct when you say that some instances of Feed will never be deleted. In the above example, feed instances at between 2015-10-30T00:00Z and 2015-10-30T10:00Z will never be deleted.

If you are expecting all feed instances to be deleted after retention time limit, Falcon should change the way it creates retention coordinator job. I opened a Jira https://issues.apache.org/jira/browse/FALCON-1644 for this.

avatar
Rising Star

Hi Balu,

thanks for your answer. My understanding of retention is that all instances older that the retention period would be deleted no matter whether dataset is valid or not. However I am not saying that is the only possible interpretation. Possibly the existing behavior could be useful in some use case. If so, maybe you can introduce the new 'retention action' such as 'delete-always' to handle such situation. This way the change will be also backward compatible and will not change existing Falcon behavior.