Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

What are best practices for setting up Backup and Disaster Recovery for Hive in production?

Solved Go to solution

What are best practices for setting up Backup and Disaster Recovery for Hive in production?

Need Best practices for Backup and DR for

- Hive Metastore DB i.e. MySQL, Postgres etc

- Hive Data

1 ACCEPTED SOLUTION

Accepted Solutions
Highlighted

Re: What are best practices for setting up Backup and Disaster Recovery for Hive in production?

New Contributor

Above two answers are great for Hive Metastore backup. Now for the hive data itself, here are few options

Option1) Hive data gets stored in HDFS (Hadoop Distributed File System), so any backup or DR (Disaster Recovery) strategy you have for HDFS could be used for Hive as well. So, you can use snapshot feature in HDFS to take a point in time image. These snapshot could be for entire file system, a sub-tree in a file system or just a file. You can also take incremental snapshot by doing a diff between two snapshots.

Option2) You can write your own distcp code and make it part of a Falcon data pipeline.

Using Distcp to copy files

Option3) You can use Falcon Data mirroring capability to mirror the data in HDFS or Hive tables.

Here is a link on that

Falcon Data Mirroring

Option4) You can have a active - active data load to both your primary cluster as well as your DR Cluster. So for example if you are using a scoop job to pull the data from a particular RDBMS and load it into hive table, you can create two scoop jobs one to load the primary cluster hive table and other to load the DR cluster Hive table.

You choice of which option to pick depends upon the SLA (Service level agreements) around DR/Backup, budget, Skill level etc.

4 REPLIES 4

Re: What are best practices for setting up Backup and Disaster Recovery for Hive in production?

For Hive on Oracle , Data Guard could be used as DR Solution. Refer : Oracle Dataguard - Transparent Application Failover

Re: What are best practices for setting up Backup and Disaster Recovery for Hive in production?

Expert Contributor

For the Hive Metastore Mysql - you can configure Hive Metastore Service for HA on multiple boxes and Mysql also need to be configured for active-active replication. - more info at High Availability for Hive Metastore.

Backup/Restore for Hive Megastore is covered in 5.1.7. Perform Backups. The backup method we normally use is "mysqldump hive > /tmp/mydir/backup_hive.sql" . Please observe that there are various ways of backing up mysql databases and the important one is to backup the Hive database schema. For a full DR solution of Mysql you need to back up mysql config files etc. For a description of Mysql Backup/Restore please see http://dev.mysql.com/doc/mysql-backup-excerpt/5.7/en/index.html.

Highlighted

Re: What are best practices for setting up Backup and Disaster Recovery for Hive in production?

New Contributor

Above two answers are great for Hive Metastore backup. Now for the hive data itself, here are few options

Option1) Hive data gets stored in HDFS (Hadoop Distributed File System), so any backup or DR (Disaster Recovery) strategy you have for HDFS could be used for Hive as well. So, you can use snapshot feature in HDFS to take a point in time image. These snapshot could be for entire file system, a sub-tree in a file system or just a file. You can also take incremental snapshot by doing a diff between two snapshots.

Option2) You can write your own distcp code and make it part of a Falcon data pipeline.

Using Distcp to copy files

Option3) You can use Falcon Data mirroring capability to mirror the data in HDFS or Hive tables.

Here is a link on that

Falcon Data Mirroring

Option4) You can have a active - active data load to both your primary cluster as well as your DR Cluster. So for example if you are using a scoop job to pull the data from a particular RDBMS and load it into hive table, you can create two scoop jobs one to load the primary cluster hive table and other to load the DR cluster Hive table.

You choice of which option to pick depends upon the SLA (Service level agreements) around DR/Backup, budget, Skill level etc.

Re: What are best practices for setting up Backup and Disaster Recovery for Hive in production?

New Contributor

@Chakra You may also have Hive data declared as an external table, in which case this data sits in a file store outside of HDFS. In such a case, as long as you back up your Hive meta store, you should be good, assuming the external file store has its own backup and restore poiclies.

Don't have an account?
Coming from Hortonworks? Activate your account here