Support Questions

Find answers, ask questions, and share your expertise

Questions on Disaster Recovery

avatar
Rising Star
  • I have seen few articles and questions on the community around Disaster Recovery. However, its still not clear completely and hence posting a new question around that:
  • As I understand, typically, these entities need to be backed-up / synced between the clusters
  • Primary Datasets

HDFS Data

Teeing - Flume / Hortonworks Data Flow

Copying / Replication - distcp (invoking it manually), Falcon

Hive Data

Behind the scenes, Hive data is stored in HDFS. So I presume the techniques of teeing / copying can be employed for HDFS as above can be used here as well.

HBase Data

HBase native DR replication mechanism - master-slave, master-master and cyclic (http://hbase.apache.org/book.html#_cluster_replication)

Solr Indexes

If indexes are being stored in HDFS, HDFS techniques would cover Solr datasets as well

  • DB backed services
  • Hive Metadata - Periodic backup of the database from primary to DR cluster

    Amber - Ambari DB contains configurations for other ecosystem components. Periodic backup of the database from primary to DR cluster

    Oozie - Oozie database contains job and workflow level information. So this need to be backed up regularly to the DR cluster

    Ranger - Ranger policy DB contains info about the various policies impacting RBAC. Need to be backed up to the DR cluster

  • Configurations

    Periodic backup of Ambari Server and Agent configurations (Ambari folders under /etc and /var)

    Periodic backup of Configuration files for each application or service under /etc directory

    Periodic backup of binaries (/usr/hadoop/current)

    Periodic backup of any OS specific changes at a node level in the primary cluster

  • Application / User data
  • Queries on DR Strategy
    • Teeing vs Copying- Which one is preferred over the other? Understand its scenario dependent. But which has better adaptability and more widely used in the industry? Copying?
    • Is it necessary to have both the main and the DR cluster on the same version of HDP? If not, what are things to consider if same version is not possible?
    • Should it be like for like topology between clusters in terms of component placement including gateway nodes and zookeeper services?
    • How does security play out for DR? Should both the cluster nodes be part of the same Kerberos realm or can they be part of different realms?
    • Can the replication factor be lower? Or it recommended to maintain it as the same as the primary cluster?
    • Any specific network requirements in terms of latency, speed etc. between the clusters
    • Is there a need to run balancer on the DR cluster periodically?
    • How does encryption play out between the primary and DR clusters? If encryption at rest is enabled in the primary one, how is it handled in the DR cluster? What are the implications of wire-encryption while transferring the data between the clusters?
    • When HDFS snapshots is enabled on the primary cluster, how does it work when data is being synced to the DR cluster? Can Snapshots be exported onto another cluster? I understand this is possible for HBase snapshots. But is it allowed in HDFS case? For example, if a file is deleted on the primary cluster, but available in the snapshot, will that be synced to the snapshot directory on the DR cluster?
    • For services which involve databases (Hive, Oozie, Ambari), instead of backing up periodically from the primary cluster to the DR cluster, is it recommended to setup one HA master in the DR cluster directly?
    • For configurations and application data, instead of backing up at regular intervals, is there a way to keep them in sync between the primary and DR clusters?
    • What extra / different functionality will third party solutions like WANDisco provide in comparison to Falcon? I am trying to understand the "active-active" working of WANDisco and why it is not possible with Falcon.
    • What is the recommendation to ensure gateway node services like Knox and client libraries are kept in sync between the clusters?
    • What is the recommendation for keeping application data, for example, Spark / Sqoop job level information?

    Apologies for the lengthy post, but want to cover all the areas around DR. Hence posted in a single question.

    Thanks

    Vijay

    1 ACCEPTED SOLUTION

    avatar
    Guru

    Teeing vs Copying- Which one is preferred over the other? Understand its scenario dependent. But which has better adaptability and more widely used in the industry? Copying?

    With Teeing, you can split up primary tasks between the 2 clusters and use the other cluster as DR for that task. As an example, if you have clusters C1 and C2, you can use C1 as primary cluster and C2 as DR for some teams/tasks and use C2 as primary cluster and C1 as DR for some other users/tasks

    Is it necessary to have both the main and the DR cluster on the same version of HDP? If not, what are things to consider if same version is not possible?

    It is convinent to have them both on same version. This is especially the case if you want to use DR with almost no code changes if primary server is down.

    Should it be like for like topology between clusters in terms of component placement including gateway nodes and zookeeper services?

    This is not required.

    How does security play out for DR? Should both the cluster nodes be part of the same Kerberos realm or can they be part of different realms?

    As a DR, same realm is a lot easier to manage than cross realm. But cross realm is possible.

    Can the replication factor be lower? Or it recommended to maintain it as the same as the primary cluster?

    I have seen using rep factor 2 on DR clusters, but in case this turns in primary after disaster you may have to change rep factor to 3 on all data sets.

    Any specific network requirements in terms of latency, speed etc. between the clusters For ditscp, each node one cluster should communicate with each of the other nodes on second cluster. Is there a need to run balancer on the DR cluster periodically?

    Yes. Always good to run balancer to keep similar number of blocks across nodes.

    How does encryption play out between the primary and DR clusters? If encryption at rest is enabled in the primary one, how is it handled in the DR cluster? What are the implications of wire-encryption while transferring the data between the clusters?

    Wire encyprtion will slow down transfers a little bit.

    When HDFS snapshots is enabled on the primary cluster, how does it work when data is being synced to the DR cluster? Can Snapshots be exported onto another cluster? I understand this is possible for HBase snapshots. But is it allowed in HDFS case? For example, if a file is deleted on the primary cluster, but available in the snapshot, will that be synced to the snapshot directory on the DR cluster?

    If you are using snapshots, you can simply use distcp on snapshots instead of actual data set.

    For services which involve databases (Hive, Oozie, Ambari), instead of backing up periodically from the primary cluster to the DR cluster, is it recommended to setup one HA master in the DR cluster directly?

    I don't think automating ambari is a good idea. Configs don't change that much so a simple process of duplicating might be better. Backing up would mean you need to have same hostnames and same topology. For hive, instead of complete backup, Falcon can take care of table level replication.

    For configurations and application data, instead of backing up at regular intervals, is there a way to keep them in sync between the primary and DR clusters?

    Not sure where your application data resides, but for configuration since everything is managed by ambari, you can need to keep ambari configuration in sync.

    View solution in original post

    2 REPLIES 2

    avatar

    Hey Vijay, yep, this might be too big of a set of questions for HCC. My suggestion is to search for particular topics to see if they are already being addressed and then ultimately, imagine these as separate discrete questions. For example, see https://community.hortonworks.com/questions/35539/snapshots-backup-and-dr.html as a pointed set of questions around snapshots; ok... that one had a bunch of Q's in one, too. 😉 Another alternative is to get hold of a solutions engineer from a company like (well, like Hortonworks!) to try to help you through all of these what-if questions. Additionally, a consultant can help you build an operational "run book" that addresses all of these concerns in a customized version for your org. Good luck!

    avatar
    Guru

    Teeing vs Copying- Which one is preferred over the other? Understand its scenario dependent. But which has better adaptability and more widely used in the industry? Copying?

    With Teeing, you can split up primary tasks between the 2 clusters and use the other cluster as DR for that task. As an example, if you have clusters C1 and C2, you can use C1 as primary cluster and C2 as DR for some teams/tasks and use C2 as primary cluster and C1 as DR for some other users/tasks

    Is it necessary to have both the main and the DR cluster on the same version of HDP? If not, what are things to consider if same version is not possible?

    It is convinent to have them both on same version. This is especially the case if you want to use DR with almost no code changes if primary server is down.

    Should it be like for like topology between clusters in terms of component placement including gateway nodes and zookeeper services?

    This is not required.

    How does security play out for DR? Should both the cluster nodes be part of the same Kerberos realm or can they be part of different realms?

    As a DR, same realm is a lot easier to manage than cross realm. But cross realm is possible.

    Can the replication factor be lower? Or it recommended to maintain it as the same as the primary cluster?

    I have seen using rep factor 2 on DR clusters, but in case this turns in primary after disaster you may have to change rep factor to 3 on all data sets.

    Any specific network requirements in terms of latency, speed etc. between the clusters For ditscp, each node one cluster should communicate with each of the other nodes on second cluster. Is there a need to run balancer on the DR cluster periodically?

    Yes. Always good to run balancer to keep similar number of blocks across nodes.

    How does encryption play out between the primary and DR clusters? If encryption at rest is enabled in the primary one, how is it handled in the DR cluster? What are the implications of wire-encryption while transferring the data between the clusters?

    Wire encyprtion will slow down transfers a little bit.

    When HDFS snapshots is enabled on the primary cluster, how does it work when data is being synced to the DR cluster? Can Snapshots be exported onto another cluster? I understand this is possible for HBase snapshots. But is it allowed in HDFS case? For example, if a file is deleted on the primary cluster, but available in the snapshot, will that be synced to the snapshot directory on the DR cluster?

    If you are using snapshots, you can simply use distcp on snapshots instead of actual data set.

    For services which involve databases (Hive, Oozie, Ambari), instead of backing up periodically from the primary cluster to the DR cluster, is it recommended to setup one HA master in the DR cluster directly?

    I don't think automating ambari is a good idea. Configs don't change that much so a simple process of duplicating might be better. Backing up would mean you need to have same hostnames and same topology. For hive, instead of complete backup, Falcon can take care of table level replication.

    For configurations and application data, instead of backing up at regular intervals, is there a way to keep them in sync between the primary and DR clusters?

    Not sure where your application data resides, but for configuration since everything is managed by ambari, you can need to keep ambari configuration in sync.