Created 05-13-2016 11:07 AM
I have a coordinator running and have my first import successful but as soon as it runs for the second time, it throws file already exists exception.
My question is how do I workaround this?
1. This isn't incremental import but a daily import of all data.
2. I do not want to delete it through <prepare> before I have a fresh copy of the import.(obviously because I would not want to lose what I have before I know I have a fresh and updated copy of the data)
3. I have --target-dir specified in the imports (Just in case that makes any diff. )
Created 05-13-2016 06:38 PM
@simran kaur You should suffix your target-dir name with date (example - rather than naming target-dir emp_snapshot, name it emp_snapshot_05132016). And then on next run, once you have updated copy of data, remove old one. It should be straight-forward. Additional benefit would be that you will know from name when last snapshot completed and was imported.
Created 05-13-2016 11:36 AM
Hi @simran kaur this may or may not help depending on your exact scenario, however I've done something similar before by using Falcon (which is driving Oozie underneath) to do exactly this.
Have a look at https://github.com/apache/falcon/tree/master/addons/hdfs-snapshot-mirroring
The reason this is nice is that it provides built in functionality to handle:
* Create snapshots in source directory * Copy this directory between HDFS clusters * Create snapshot in target directory * Handle snapshot retention in source and target directories
It's honestly going to be much easier than writing that all yourself within Oozie, you don't need to use it to mirror those snapshots between clusters, you can use it within a single cluster. Hope that helps!
Created 05-13-2016 06:27 PM
Can you do date specific output directory using <prepare> and use that directory for storing your outputs?
Note sure if this will help in your case, just wanted to let you know
Created 05-13-2016 06:38 PM
@simran kaur You should suffix your target-dir name with date (example - rather than naming target-dir emp_snapshot, name it emp_snapshot_05132016). And then on next run, once you have updated copy of data, remove old one. It should be straight-forward. Additional benefit would be that you will know from name when last snapshot completed and was imported.