Not sure where to post this question, let me know if this is the wrong section.
I've an Oozie bundle with some coordinators inside which import data from various sources and generate hive tables with some transformations. This is scheduled once every day.
I need to design a rollback procedure that brings the cluster to the status of the previous day.
I was thinking to add these two operations before starting the daily import/transformation tasks:
Then when I need to rollback I can stop the current Oozie bundle, overwrite the hdfs Hive data with the data in the backup folder and restore the Hive Metastore database.
Thanks for any information
When I've had to support data rollback in Hive (and every other data warehouse project concerned with reliability), I've leveraged partitions to support logical rollback, vs physically moving data around or restoring the metastore. The process flow looked something like:
1. Create a table, partitioned by load date or id.
2. Insert data into that partition.
3. Create a view in an interface layer (another DB) that queries from the base table, and filters to that one partition. Query tools / clients only query from this interface layer.
4. Next load, add a new partition.
5. Insert data into the new partition.
6. Update the interface view to point to the new partition.
When rollback is needed, it's easy to update all the views to point to a previous partition that represents an earlier point-in-time.
With the views filtering by a partition key, Hive and Impala (but not Spark SQL) will use partition pruning to only select from the correct partitions. Another benefit of the above approach is you have point-in-time snapshots of data over time, which makes it very easy to build a data warehouse with snapshot facts. It can also be made to work with tables that are appended to (vs replaced or merged into), assuming you aren't updating data in place, e.g., you're using new partitions with each load.
The downside of the above approach is you can end up with lots of partitions, which can be a problem with Impala (it scales less well than Hive on partition count), but that's mitigated with a good retention policy or table archive implementation.
Thank you very much for the detailed answer @mzkdm.
This is indeed a very interesting point. Do you think could make sense to have daily-based partitions, since my main ingestion workflow run once a day? And how can I force Hive or Impala users to use the last point-in-time data?
Thanks for the help!