I use DistCp every X minutes to transfert the HDFS data to a hot bachup cluster, should I replicate the whole hive metastore database (manualy or using DB HA..) to accomplish the backup/restore ? or I need just to import/export only some specefic hive metastore tables ?
Thanks in advance.
I think it is purly depends upon your business... still the below may help you...
I don't think a whole hive metastore database backup is required for every X minutes unless you have a specific usecase. So you can follow
1. Take a whole hive metastore db backup once
2. Take the delta/impacted table backup on top of it every X minutes
So that your backup metastore is up to date
I don't think any easy way is available to get the delta/impacted tables out of 56 tables.
A high level idea, you can improve it as needed
1. Create a sample table
2. check the impact on underlined HDFS folder whenever any DDL changes applied on it (or) if you can manage to capture the date/something else from descibe the formatted table before and after DDL change then it is good option.
3. Write a script and pass the 56 tables as a parameter. The script will simply describe the formatted output for the given table and idenfiy the impact compare to your previous X hour
4. Run the above script before you take a back-up every X hours
But I thing there is a misunderstanding, because I mean by the 56 tables the metastore (postgres) database tables! And the backup approach that I used based on 2 steps:
1- Backup the updated HDFS tables by DistCp (done).
2- Backup the metastore (postgres) datastores (current question).