Support Questions

bhoomireddy_vij · ‎06-22-2016

Hi,

I am looking for views on which is a better strategy for replication Hive and its metadata for DR purposes between two kerberised and high secure clusters.

Copy underlying Hive data on HDFS using distcp2, preferably with HDFS snapshots enabled

vs

Using Falcon mirroring / replication for Hive (which I presume covers both Hive data and as well as metadata)

Are there any caveats that I need to be aware of with either of these approaches and which one is preferred over the other, specifically having HDP2.4.2 in mind.

Thanks

Vijay

pminovic · ‎06-22-2016

Falcon mirroring is better than distcp because both Hive metadata and data files are mirrored together. With distcp, you would have to mirror Hive metadata separately. A caveat is that all Hive databases and tables existing before Falcon mirroring starts have to be mirrored separately, for example using Hive's export/import table. Also, it's better to mirror whole databases, not individual tables. If you mirror a whole database, then any newly created table on the source cluster will be automatically mirrored (created) on the DR cluster. However, some operations like ACID delete and update are not supported. Also, the mechanism is more complicated: Falcon will schedule Oozie jobs, and they will do mirroring. It's better to run those jobs on the DR cluster to spare resources on the source cluster. And finally, if your clusters use Name Node HA, you will have to configure hdfs on DR cluster to be aware of NN HA setup on the source cluster (this holds also for distcp approach). You can find more details in the Data Governance Guide, sections 2 and 4.

View solution in original post

pminovic · ‎06-22-2016

Falcon mirroring is better than distcp because both Hive metadata and data files are mirrored together. With distcp, you would have to mirror Hive metadata separately. A caveat is that all Hive databases and tables existing before Falcon mirroring starts have to be mirrored separately, for example using Hive's export/import table. Also, it's better to mirror whole databases, not individual tables. If you mirror a whole database, then any newly created table on the source cluster will be automatically mirrored (created) on the DR cluster. However, some operations like ACID delete and update are not supported. Also, the mechanism is more complicated: Falcon will schedule Oozie jobs, and they will do mirroring. It's better to run those jobs on the DR cluster to spare resources on the source cluster. And finally, if your clusters use Name Node HA, you will have to configure hdfs on DR cluster to be aware of NN HA setup on the source cluster (this holds also for distcp approach). You can find more details in the Data Governance Guide, sections 2 and 4.

bhoomireddy_vij · ‎06-22-2016

@Predrag Monodic. Thanks for your response. Some follow-up questions

With distcp, you would have to mirror Hive metadata separately.

If distcp based mechanism is chosen, for metadata, is it a simply copy of the underlying metastore db that need to be copied over? Or are there further consideration?

A caveat is that all Hive databases and tables existing before Falcon mirroring starts have to be mirrored separately

We are building a DR cluster for an existing production cluster, which uses Hive extensively. So does that mean, for the initial migration once the DR cluster kicks-off, it need to be done externally to Falcon using Hive export/import mechanism?

However, some operations like ACID delete and update are not supported.

Can you please throw more light on this? When / why these are not supported when Falcon is the solution choice?

Falcon will schedule Oozie jobs, and they will do mirroring. It's better to run those jobs on the DR cluster to spare resources on the source cluster.

Are we saying to run the Falcon (and consequently the underlying Oozie) mainly on DR cluster, rather than the source cluster?

pminovic · ‎06-23-2016

Let me answer your follow-up questions:

Metadata is not a simple copy, there are some per-cluster settings that would have to be taken into account.
Yes, Hive mirroring has to be "bootstrapped" using Hive export/import feature, externally to Falcon.
ACID usage is not wide-spread yet, and Falcon mirroring is "set-it-and-forget-it". Also, new features are coming in new versions of Falcon and ACID will be supported before long.
Falcon and Oozie jobs can be run on either cluster. I prefer to run them on DR cluster which is not doing much anyway, instead of using the busy production (source) cluster.

bhoomireddy_vij · ‎06-23-2016

@Predrag Monodic Thanks for your responses.

Cloudera Community

Support Questions

Hive replication between clusters - Falcon based Hive replication vs distcp2 + database replication

HDFS Snapshots Based Replication Using Apache Falc...

Accelerating Replication and Decommissioning in HD...

How to use CDP Replication Manager to replicate da...

Hive database replication with MySQL Cluster

HBase Replication - FAQ

Does HDFS 3x replication still make sense?

Hive - Replication issue through Cloudera Manager

falcon replication - overlap

Monitor a Replication Policy details on a Datalake...

How to configure external accounts for Streams Rep...