I am currently using cron and my driver code to merge sequence files for the previous hour. So for example, if I have 6 files, and 3 of them are below the block size, I will merge small files together to try and get close to the box size. Cron is currently kicking off at 10 past the hour and feeding in the previous hour, e.g. /data//2014/01/14/09 if the current hour is 10. It will merge these files and replace the small files with these merged files.
I have been reading the oozie documentation, and I am having a hard time expressing this. Do I need to create a dataset for the previous hour? And the output dataset is the same as the input.
Any pointers much appreciated.