Created 06-22-2016 09:07 AM
Hi,
I am looking for views on which is a better strategy for replication Hive and its metadata for DR purposes between two kerberised and high secure clusters.
vs
Are there any caveats that I need to be aware of with either of these approaches and which one is preferred over the other, specifically having HDP2.4.2 in mind.
Thanks
Vijay
Created 06-22-2016 10:24 AM
Falcon mirroring is better than distcp because both Hive metadata and data files are mirrored together. With distcp, you would have to mirror Hive metadata separately. A caveat is that all Hive databases and tables existing before Falcon mirroring starts have to be mirrored separately, for example using Hive's export/import table. Also, it's better to mirror whole databases, not individual tables. If you mirror a whole database, then any newly created table on the source cluster will be automatically mirrored (created) on the DR cluster. However, some operations like ACID delete and update are not supported. Also, the mechanism is more complicated: Falcon will schedule Oozie jobs, and they will do mirroring. It's better to run those jobs on the DR cluster to spare resources on the source cluster. And finally, if your clusters use Name Node HA, you will have to configure hdfs on DR cluster to be aware of NN HA setup on the source cluster (this holds also for distcp approach). You can find more details in the Data Governance Guide, sections 2 and 4.
Created 06-22-2016 10:24 AM
Falcon mirroring is better than distcp because both Hive metadata and data files are mirrored together. With distcp, you would have to mirror Hive metadata separately. A caveat is that all Hive databases and tables existing before Falcon mirroring starts have to be mirrored separately, for example using Hive's export/import table. Also, it's better to mirror whole databases, not individual tables. If you mirror a whole database, then any newly created table on the source cluster will be automatically mirrored (created) on the DR cluster. However, some operations like ACID delete and update are not supported. Also, the mechanism is more complicated: Falcon will schedule Oozie jobs, and they will do mirroring. It's better to run those jobs on the DR cluster to spare resources on the source cluster. And finally, if your clusters use Name Node HA, you will have to configure hdfs on DR cluster to be aware of NN HA setup on the source cluster (this holds also for distcp approach). You can find more details in the Data Governance Guide, sections 2 and 4.
Created 06-22-2016 11:07 PM
@Predrag Monodic. Thanks for your response. Some follow-up questions
With distcp, you would have to mirror Hive metadata separately.
If distcp based mechanism is chosen, for metadata, is it a simply copy of the underlying metastore db that need to be copied over? Or are there further consideration?
A caveat is that all Hive databases and tables existing before Falcon mirroring starts have to be mirrored separately
We are building a DR cluster for an existing production cluster, which uses Hive extensively. So does that mean, for the initial migration once the DR cluster kicks-off, it need to be done externally to Falcon using Hive export/import mechanism?
However, some operations like ACID delete and update are not supported.
Can you please throw more light on this? When / why these are not supported when Falcon is the solution choice?
Falcon will schedule Oozie jobs, and they will do mirroring. It's better to run those jobs on the DR cluster to spare resources on the source cluster.
Are we saying to run the Falcon (and consequently the underlying Oozie) mainly on DR cluster, rather than the source cluster?
Created 06-23-2016 07:56 AM
Let me answer your follow-up questions:
Created 06-23-2016 10:29 AM
@Predrag Monodic Thanks for your responses.