Support Questions

ludof · ‎02-27-2018

Not sure where to post this question, let me know if this is the wrong section.

I've an Oozie bundle with some coordinators inside which import data from various sources and generate hive tables with some transformations. This is scheduled once every day.

I need to design a rollback procedure that brings the cluster to the status of the previous day.

I was thinking to add these two operations before starting the daily import/transformation tasks:

Make a snapshot (backup) of the hdfs hive data in a backup folder
Make a backup of the Hive Metastore database (MySQL)

Then when I need to rollback I can stop the current Oozie bundle, overwrite the hdfs Hive data with the data in the backup folder and restore the Hive Metastore database.

My questions:

Is this going to work? Or are there critical problems that I am not seeing?
Which approach do you guys suggest to support rollback in a CLoudera environment?

Thanks for any information

mzkdm · ‎03-27-2018

Hi,

When I've had to support data rollback in Hive (and every other data warehouse project concerned with reliability), I've leveraged partitions to support logical rollback, vs physically moving data around or restoring the metastore. The process flow looked something like:

1. Create a table, partitioned by load date or id.

2. Insert data into that partition.

3. Create a view in an interface layer (another DB) that queries from the base table, and filters to that one partition. Query tools / clients only query from this interface layer.

4. Next load, add a new partition.

5. Insert data into the new partition.

6. Update the interface view to point to the new partition.

When rollback is needed, it's easy to update all the views to point to a previous partition that represents an earlier point-in-time.

With the views filtering by a partition key, Hive and Impala (but not Spark SQL) will use partition pruning to only select from the correct partitions. Another benefit of the above approach is you have point-in-time snapshots of data over time, which makes it very easy to build a data warehouse with snapshot facts. It can also be made to work with tables that are appended to (vs replaced or merged into), assuming you aren't updating data in place, e.g., you're using new partitions with each load.

The downside of the above approach is you can end up with lots of partitions, which can be a problem with Impala (it scales less well than Hive on partition count), but that's mitigated with a good retention policy or table archive implementation.

Regards,

Mark

ludof · ‎04-10-2018

Thank you very much for the detailed answer @mzkdm.

This is indeed a very interesting point. Do you think could make sense to have daily-based partitions, since my main ingestion workflow run once a day? And how can I force Hive or Impala users to use the last point-in-time data?

Thanks for the help!

Cloudera Community

Support Questions

Rollback