Support Questions

Find answers, ask questions, and share your expertise

How to take daily snapshots of data through oozie

avatar
Expert Contributor

I have a coordinator running and have my first import successful but as soon as it runs for the second time, it throws file already exists exception.

My question is how do I workaround this?

1. This isn't incremental import but a daily import of all data.

2. I do not want to delete it through <prepare> before I have a fresh copy of the import.(obviously because I would not want to lose what I have before I know I have a fresh and updated copy of the data)

3. I have --target-dir specified in the imports (Just in case that makes any diff. )

1 ACCEPTED SOLUTION

avatar

@simran kaur You should suffix your target-dir name with date (example - rather than naming target-dir emp_snapshot, name it emp_snapshot_05132016). And then on next run, once you have updated copy of data, remove old one. It should be straight-forward. Additional benefit would be that you will know from name when last snapshot completed and was imported.

View solution in original post

3 REPLIES 3

avatar

Hi @simran kaur this may or may not help depending on your exact scenario, however I've done something similar before by using Falcon (which is driving Oozie underneath) to do exactly this.

Have a look at https://github.com/apache/falcon/tree/master/addons/hdfs-snapshot-mirroring

The reason this is nice is that it provides built in functionality to handle:

* Create snapshots in source directory
* Copy this directory between HDFS clusters
* Create snapshot in target directory
* Handle snapshot retention in source and target directories

It's honestly going to be much easier than writing that all yourself within Oozie, you don't need to use it to mirror those snapshots between clusters, you can use it within a single cluster. Hope that helps!

avatar
Master Guru
@simran kaur

Can you do date specific output directory using <prepare> and use that directory for storing your outputs?

Note sure if this will help in your case, just wanted to let you know

avatar

@simran kaur You should suffix your target-dir name with date (example - rather than naming target-dir emp_snapshot, name it emp_snapshot_05132016). And then on next run, once you have updated copy of data, remove old one. It should be straight-forward. Additional benefit would be that you will know from name when last snapshot completed and was imported.