Created 05-26-2016 01:56 PM
HDFS Data
Teeing - Flume / Hortonworks Data Flow
Copying / Replication - distcp (invoking it manually), Falcon
Hive Data
Behind the scenes, Hive data is stored in HDFS. So I presume the techniques of teeing / copying can be employed for HDFS as above can be used here as well.
HBase Data
HBase native DR replication mechanism - master-slave, master-master and cyclic (http://hbase.apache.org/book.html#_cluster_replication)
Solr Indexes
If indexes are being stored in HDFS, HDFS techniques would cover Solr datasets as well
Hive Metadata - Periodic backup of the database from primary to DR cluster
Amber - Ambari DB contains configurations for other ecosystem components. Periodic backup of the database from primary to DR cluster
Oozie - Oozie database contains job and workflow level information. So this need to be backed up regularly to the DR cluster
Ranger - Ranger policy DB contains info about the various policies impacting RBAC. Need to be backed up to the DR cluster
Periodic backup of Ambari Server and Agent configurations (Ambari folders under /etc and /var)
Periodic backup of Configuration files for each application or service under /etc directory
Periodic backup of binaries (/usr/hadoop/current)
Periodic backup of any OS specific changes at a node level in the primary cluster
Apologies for the lengthy post, but want to cover all the areas around DR. Hence posted in a single question.
Thanks
Vijay
Created 05-26-2016 02:46 PM
Teeing vs Copying- Which one is preferred over the other? Understand its scenario dependent. But which has better adaptability and more widely used in the industry? Copying?
With Teeing, you can split up primary tasks between the 2 clusters and use the other cluster as DR for that task. As an example, if you have clusters C1 and C2, you can use C1 as primary cluster and C2 as DR for some teams/tasks and use C2 as primary cluster and C1 as DR for some other users/tasks
Is it necessary to have both the main and the DR cluster on the same version of HDP? If not, what are things to consider if same version is not possible?
It is convinent to have them both on same version. This is especially the case if you want to use DR with almost no code changes if primary server is down.
Should it be like for like topology between clusters in terms of component placement including gateway nodes and zookeeper services?
This is not required.
How does security play out for DR? Should both the cluster nodes be part of the same Kerberos realm or can they be part of different realms?
As a DR, same realm is a lot easier to manage than cross realm. But cross realm is possible.
Can the replication factor be lower? Or it recommended to maintain it as the same as the primary cluster?
I have seen using rep factor 2 on DR clusters, but in case this turns in primary after disaster you may have to change rep factor to 3 on all data sets.
Any specific network requirements in terms of latency, speed etc. between the clusters For ditscp, each node one cluster should communicate with each of the other nodes on second cluster. Is there a need to run balancer on the DR cluster periodically?
Yes. Always good to run balancer to keep similar number of blocks across nodes.
How does encryption play out between the primary and DR clusters? If encryption at rest is enabled in the primary one, how is it handled in the DR cluster? What are the implications of wire-encryption while transferring the data between the clusters?
Wire encyprtion will slow down transfers a little bit.
When HDFS snapshots is enabled on the primary cluster, how does it work when data is being synced to the DR cluster? Can Snapshots be exported onto another cluster? I understand this is possible for HBase snapshots. But is it allowed in HDFS case? For example, if a file is deleted on the primary cluster, but available in the snapshot, will that be synced to the snapshot directory on the DR cluster?
If you are using snapshots, you can simply use distcp on snapshots instead of actual data set.
For services which involve databases (Hive, Oozie, Ambari), instead of backing up periodically from the primary cluster to the DR cluster, is it recommended to setup one HA master in the DR cluster directly?
I don't think automating ambari is a good idea. Configs don't change that much so a simple process of duplicating might be better. Backing up would mean you need to have same hostnames and same topology. For hive, instead of complete backup, Falcon can take care of table level replication.
For configurations and application data, instead of backing up at regular intervals, is there a way to keep them in sync between the primary and DR clusters?
Not sure where your application data resides, but for configuration since everything is managed by ambari, you can need to keep ambari configuration in sync.
Created 05-26-2016 02:28 PM
Hey Vijay, yep, this might be too big of a set of questions for HCC. My suggestion is to search for particular topics to see if they are already being addressed and then ultimately, imagine these as separate discrete questions. For example, see https://community.hortonworks.com/questions/35539/snapshots-backup-and-dr.html as a pointed set of questions around snapshots; ok... that one had a bunch of Q's in one, too. 😉 Another alternative is to get hold of a solutions engineer from a company like (well, like Hortonworks!) to try to help you through all of these what-if questions. Additionally, a consultant can help you build an operational "run book" that addresses all of these concerns in a customized version for your org. Good luck!
Created 05-26-2016 02:46 PM
Teeing vs Copying- Which one is preferred over the other? Understand its scenario dependent. But which has better adaptability and more widely used in the industry? Copying?
With Teeing, you can split up primary tasks between the 2 clusters and use the other cluster as DR for that task. As an example, if you have clusters C1 and C2, you can use C1 as primary cluster and C2 as DR for some teams/tasks and use C2 as primary cluster and C1 as DR for some other users/tasks
Is it necessary to have both the main and the DR cluster on the same version of HDP? If not, what are things to consider if same version is not possible?
It is convinent to have them both on same version. This is especially the case if you want to use DR with almost no code changes if primary server is down.
Should it be like for like topology between clusters in terms of component placement including gateway nodes and zookeeper services?
This is not required.
How does security play out for DR? Should both the cluster nodes be part of the same Kerberos realm or can they be part of different realms?
As a DR, same realm is a lot easier to manage than cross realm. But cross realm is possible.
Can the replication factor be lower? Or it recommended to maintain it as the same as the primary cluster?
I have seen using rep factor 2 on DR clusters, but in case this turns in primary after disaster you may have to change rep factor to 3 on all data sets.
Any specific network requirements in terms of latency, speed etc. between the clusters For ditscp, each node one cluster should communicate with each of the other nodes on second cluster. Is there a need to run balancer on the DR cluster periodically?
Yes. Always good to run balancer to keep similar number of blocks across nodes.
How does encryption play out between the primary and DR clusters? If encryption at rest is enabled in the primary one, how is it handled in the DR cluster? What are the implications of wire-encryption while transferring the data between the clusters?
Wire encyprtion will slow down transfers a little bit.
When HDFS snapshots is enabled on the primary cluster, how does it work when data is being synced to the DR cluster? Can Snapshots be exported onto another cluster? I understand this is possible for HBase snapshots. But is it allowed in HDFS case? For example, if a file is deleted on the primary cluster, but available in the snapshot, will that be synced to the snapshot directory on the DR cluster?
If you are using snapshots, you can simply use distcp on snapshots instead of actual data set.
For services which involve databases (Hive, Oozie, Ambari), instead of backing up periodically from the primary cluster to the DR cluster, is it recommended to setup one HA master in the DR cluster directly?
I don't think automating ambari is a good idea. Configs don't change that much so a simple process of duplicating might be better. Backing up would mean you need to have same hostnames and same topology. For hive, instead of complete backup, Falcon can take care of table level replication.
For configurations and application data, instead of backing up at regular intervals, is there a way to keep them in sync between the primary and DR clusters?
Not sure where your application data resides, but for configuration since everything is managed by ambari, you can need to keep ambari configuration in sync.