Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Hive replication between clusters - Falcon based Hive replication vs distcp2 + database replication

Solved Go to solution

Hive replication between clusters - Falcon based Hive replication vs distcp2 + database replication

Contributor

Hi,

I am looking for views on which is a better strategy for replication Hive and its metadata for DR purposes between two kerberised and high secure clusters.

  • Copy underlying Hive data on HDFS using distcp2, preferably with HDFS snapshots enabled

vs

  • Using Falcon mirroring / replication for Hive (which I presume covers both Hive data and as well as metadata)

Are there any caveats that I need to be aware of with either of these approaches and which one is preferred over the other, specifically having HDP2.4.2 in mind.

Thanks

Vijay

1 ACCEPTED SOLUTION

Accepted Solutions

Re: Hive replication between clusters - Falcon based Hive replication vs distcp2 + database replication

Falcon mirroring is better than distcp because both Hive metadata and data files are mirrored together. With distcp, you would have to mirror Hive metadata separately. A caveat is that all Hive databases and tables existing before Falcon mirroring starts have to be mirrored separately, for example using Hive's export/import table. Also, it's better to mirror whole databases, not individual tables. If you mirror a whole database, then any newly created table on the source cluster will be automatically mirrored (created) on the DR cluster. However, some operations like ACID delete and update are not supported. Also, the mechanism is more complicated: Falcon will schedule Oozie jobs, and they will do mirroring. It's better to run those jobs on the DR cluster to spare resources on the source cluster. And finally, if your clusters use Name Node HA, you will have to configure hdfs on DR cluster to be aware of NN HA setup on the source cluster (this holds also for distcp approach). You can find more details in the Data Governance Guide, sections 2 and 4.

4 REPLIES 4

Re: Hive replication between clusters - Falcon based Hive replication vs distcp2 + database replication

Falcon mirroring is better than distcp because both Hive metadata and data files are mirrored together. With distcp, you would have to mirror Hive metadata separately. A caveat is that all Hive databases and tables existing before Falcon mirroring starts have to be mirrored separately, for example using Hive's export/import table. Also, it's better to mirror whole databases, not individual tables. If you mirror a whole database, then any newly created table on the source cluster will be automatically mirrored (created) on the DR cluster. However, some operations like ACID delete and update are not supported. Also, the mechanism is more complicated: Falcon will schedule Oozie jobs, and they will do mirroring. It's better to run those jobs on the DR cluster to spare resources on the source cluster. And finally, if your clusters use Name Node HA, you will have to configure hdfs on DR cluster to be aware of NN HA setup on the source cluster (this holds also for distcp approach). You can find more details in the Data Governance Guide, sections 2 and 4.

Highlighted

Re: Hive replication between clusters - Falcon based Hive replication vs distcp2 + database replication

Contributor

@Predrag Monodic. Thanks for your response. Some follow-up questions

With distcp, you would have to mirror Hive metadata separately.

If distcp based mechanism is chosen, for metadata, is it a simply copy of the underlying metastore db that need to be copied over? Or are there further consideration?

A caveat is that all Hive databases and tables existing before Falcon mirroring starts have to be mirrored separately

We are building a DR cluster for an existing production cluster, which uses Hive extensively. So does that mean, for the initial migration once the DR cluster kicks-off, it need to be done externally to Falcon using Hive export/import mechanism?

However, some operations like ACID delete and update are not supported.

Can you please throw more light on this? When / why these are not supported when Falcon is the solution choice?

Falcon will schedule Oozie jobs, and they will do mirroring. It's better to run those jobs on the DR cluster to spare resources on the source cluster.

Are we saying to run the Falcon (and consequently the underlying Oozie) mainly on DR cluster, rather than the source cluster?

Re: Hive replication between clusters - Falcon based Hive replication vs distcp2 + database replication

Let me answer your follow-up questions:

  • Metadata is not a simple copy, there are some per-cluster settings that would have to be taken into account.
  • Yes, Hive mirroring has to be "bootstrapped" using Hive export/import feature, externally to Falcon.
  • ACID usage is not wide-spread yet, and Falcon mirroring is "set-it-and-forget-it". Also, new features are coming in new versions of Falcon and ACID will be supported before long.
  • Falcon and Oozie jobs can be run on either cluster. I prefer to run them on DR cluster which is not doing much anyway, instead of using the busy production (source) cluster.

Re: Hive replication between clusters - Falcon based Hive replication vs distcp2 + database replication

Contributor

@Predrag Monodic Thanks for your responses.

Don't have an account?
Coming from Hortonworks? Activate your account here