Created 09-29-2015 12:51 AM
Need Best practices for Backup and DR for
- Hive Metastore DB i.e. MySQL, Postgres etc
- Hive Data
Created 09-29-2015 11:38 AM
Above two answers are great for Hive Metastore backup. Now for the hive data itself, here are few options
Option1) Hive data gets stored in HDFS (Hadoop Distributed File System), so any backup or DR (Disaster Recovery) strategy you have for HDFS could be used for Hive as well. So, you can use snapshot feature in HDFS to take a point in time image. These snapshot could be for entire file system, a sub-tree in a file system or just a file. You can also take incremental snapshot by doing a diff between two snapshots.
Option2) You can write your own distcp code and make it part of a Falcon data pipeline.
Option3) You can use Falcon Data mirroring capability to mirror the data in HDFS or Hive tables.
Here is a link on that
Option4) You can have a active - active data load to both your primary cluster as well as your DR Cluster. So for example if you are using a scoop job to pull the data from a particular RDBMS and load it into hive table, you can create two scoop jobs one to load the primary cluster hive table and other to load the DR cluster Hive table.
You choice of which option to pick depends upon the SLA (Service level agreements) around DR/Backup, budget, Skill level etc.
Created 09-29-2015 09:33 AM
For Hive on Oracle , Data Guard could be used as DR Solution. Refer : Oracle Dataguard - Transparent Application Failover
Created 09-29-2015 10:36 AM
For the Hive Metastore Mysql - you can configure Hive Metastore Service for HA on multiple boxes and Mysql also need to be configured for active-active replication. - more info at High Availability for Hive Metastore.
Backup/Restore for Hive Megastore is covered in 5.1.7. Perform Backups. The backup method we normally use is "mysqldump hive > /tmp/mydir/backup_hive.sql" . Please observe that there are various ways of backing up mysql databases and the important one is to backup the Hive database schema. For a full DR solution of Mysql you need to back up mysql config files etc. For a description of Mysql Backup/Restore please see http://dev.mysql.com/doc/mysql-backup-excerpt/5.7/en/index.html.
Created 09-29-2015 11:38 AM
Above two answers are great for Hive Metastore backup. Now for the hive data itself, here are few options
Option1) Hive data gets stored in HDFS (Hadoop Distributed File System), so any backup or DR (Disaster Recovery) strategy you have for HDFS could be used for Hive as well. So, you can use snapshot feature in HDFS to take a point in time image. These snapshot could be for entire file system, a sub-tree in a file system or just a file. You can also take incremental snapshot by doing a diff between two snapshots.
Option2) You can write your own distcp code and make it part of a Falcon data pipeline.
Option3) You can use Falcon Data mirroring capability to mirror the data in HDFS or Hive tables.
Here is a link on that
Option4) You can have a active - active data load to both your primary cluster as well as your DR Cluster. So for example if you are using a scoop job to pull the data from a particular RDBMS and load it into hive table, you can create two scoop jobs one to load the primary cluster hive table and other to load the DR cluster Hive table.
You choice of which option to pick depends upon the SLA (Service level agreements) around DR/Backup, budget, Skill level etc.
Created 07-03-2016 03:09 AM
@Chakra You may also have Hive data declared as an external table, in which case this data sits in a file store outside of HDFS. In such a case, as long as you back up your Hive meta store, you should be good, assuming the external file store has its own backup and restore poiclies.