Reply
New Contributor
Posts: 1
Registered: ‎12-08-2017

BDR - Hive replication to a cluster with existing databases

I'm preparing to enable BDR Hive Replication from Cluster1 (source) to Cluster2 (destination).

In BDR configuration I don't see any setting that would allow me to store databases in destination cluster under different name than they have in source.

Does that mean that, for instance, 'default' database will be overwritten in Cluster2 after running Hive Replication?

My goal is to have a backup of Cluster1 Hive & Impala databases on Cluster2.
However, I do have a number of Cluster2 databases with the same names as in Cluster1 and I don't want to delete them.



Highlighted
Posts: 866
Topics: 1
Kudos: 200
Solutions: 107
Registered: ‎04-22-2014

Re: BDR - Hive replication to a cluster with existing databases

@MichalAR,

 

Before I try to answer, we need to clarify some things here.

 

Hive Replication replicates:

 

- Hive metadata (databases, tables, partitions)

 

If you choose to also replicate data for the Hive metadata, then the following is replicated:

 

- HDFS data

 

The HDFS data copied is based on the "LOCATION" of data specified in the Hive metadata.

 

----------------------------------

To answer your questions more directly:

 

(1)

 

Yes, BDR allows you to specify an alternate "root" for your replicated Hive data files.  By default it is "/".  In the Advanced Tab of the Replication Schedule, you can alter this root by setting "HDFS Destination Path" to your desired directory.

 

From the docs here:

https://www.cloudera.com/documentation/enterprise/latest/topics/cm_bdr_hive_replication.html#concept...

 

we learn:

 

By default, Hive HDFS data files (for example, /user/hive/warehouse/db1/t1) are replicated to a location relative to "/" (in this example, to /user/hive/warehouse/db1/t1). To override the default, enter a path in the HDFS Destination Path field. For example, if you enter /ReplicatedData, the data files would be replicated to /ReplicatedData/user/hive/warehouse/db1/t1.

 

(2)

 

"default" is the name of a database, so the location where the underlying data is stored is not really relevant to the metadata.  What I mean is that your choice of HDFS Destionation Path will not govern what gets "overwritten" at the Hive metadata level where "default" means something as the name of the database.

 

That said, replication will not overwrite incompatible metadata in the destination Hive instance by default.

You can override this behavior and make sure the destination is an exact copy of the source's metadata by enabling Force Overwrite in your Hive Replication Configuration.  See the documentation page mentioned above for more inof.

 

Based on what you say, I don't think you can accomplish your goal withough some alternative action.

 

- Perhaps you can create a second Hive Service with a custom warehouse directory that you can use to replicate to.  The replication configuration should let you choose your destination service.  I've never tested that, but it sounds reasonable.

- Another idea is to set up a separate cluster that will act only as your backup.

Announcements