Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Use Falcon only to implement hdfs data retention in single cluster?

Use Falcon only to implement hdfs data retention in single cluster?

Contributor

We want to implement retention policy for different hdfs directories and are evaluating Falcon to achieve this. In our case, we have random hdfs directory names without any standard naming conventions like /data/incoming/customer1/, /data/outgoing/customer2/, /data/temp/process1-n/ etc.

Can we use falcon in such case to implement data retention? If yes, what are high level steps?

7 REPLIES 7

Re: Use Falcon only to implement hdfs data retention in single cluster?

@Rahul Reddy

One way I see to achieve this is to use same cluster as source and target.

Then schedule a feed, something like below:

<?xml version="1.0" encoding="UTF-8"?> 
<feed description="Hourly Feed" name="testHourlyFeed" xmlns="uri:falcon:feed:0.1">  
<frequency>minutes(60)</frequency>
<late-arrival cut-off="hours(1)"/>    
<clusters>
<cluster name="source" type="source">
<validity start="2016-06-30T00:00Z" end="2016-08-01T00:00Z" />          
<retention limit="hours(1)" action="delete"/>
</cluster>       
<cluster name="secondary" type="target">
<validity start="2016-06-01T00:00Z" end="2016-08-01T00:00Z" />       
<retention limit="days(7)" action="delete"/>      
<locations>
<location type="data" path="/user/ambari-qa/data/repl-in/${YEAR}/${MONTH}/${DAY}/${HOUR}"/>
</locations>
</cluster>
</clusters>
<locations>
<location type="data" path="/user/ambari-qa/data/in/${YEAR}/${MONTH}/${DAY}/${HOUR}"/>
<location type="stats" path="/none"/>
<location type="meta" path="/none"/>
</locations>   
<ACL owner="ambari-qa" group="hdfs" permission="0755"/>  
<schema location="/none" provider="none"/> 
</feed> 

Re: Use Falcon only to implement hdfs data retention in single cluster?

Contributor

Our concern is that we do not have time based hdfs directory structure ...like what you mentioned in example:

<location type="data" path="/user/ambari-qa/data/in/${YEAR}/${MONTH}/${DAY}/${HOUR}"/><location type="stats" path="/none"/>

Does falcon have ability to have ability to look at file created timestamp of file and apply retention policy??

Re: Use Falcon only to implement hdfs data retention in single cluster?

@Rahul Reddy: Currently Falcon only supports eviction based on dated partitions.

Re: Use Falcon only to implement hdfs data retention in single cluster?

Contributor

@Sowmya Ramesh : could you confirm that we have to have dated partition format in director structure?

Re: Use Falcon only to implement hdfs data retention in single cluster?

@ScipioTheYounger

Yes you need to have dated directory structure.

Re: Use Falcon only to implement hdfs data retention in single cluster?

Contributor

I tested myself with a single cluster. Only saw two retention WF got created without any replication WF. So replication within a single cluster might have issues.

Re: Use Falcon only to implement hdfs data retention in single cluster?

@ScipioTheYounger

You need to check oozie co-ordinators.

Don't have an account?
Coming from Hortonworks? Activate your account here