Support Questions

Find answers, ask questions, and share your expertise

Full Disaster Recovery with Multiple On-Premise Data Centers

avatar
Master Guru

I am investigating a good disaster recovery solution for banking with multiple petabytes of data. This would be data in HDFS (parquet, avro), Kafka, Hive and HBase.

Not just the data, but keeping BI tools in sync and having Spark jobs still function.

I have looked at WANDisco, but thats HBase and HDFS. Is there something to keep applications and BI items in sync.

1 ACCEPTED SOLUTION

avatar
Master Guru

Also Cloudera has tools for Hive and other Data replication as part of CDP

View solution in original post

5 REPLIES 5

avatar
Master Guru

Cool thanks. We have tried to do the kafka mirroring and that has had a lot of issues. I am thinking NIFI can solve alot of these problems. I think it's a matter of budget. How many nodes of NIFI an extra nodes to help process this data migrating over.

A few people were thinking Dual Ingest, but that is hard to keep in sync usually. With NIFI, that should not be a problem.

I wonder if someone has a DR example in NIFI worked up already?

avatar

Hello @TimothySpann , 3 years later from your post I'm in a similar situation :). Would you let me know how you solved it? In my company (telco) we're planning using 2 hot clusters with dual ingest because our RTO is demanding and looking for mechanisms to monitor and keep in sync both clusters

avatar
Master Guru

Cloudera Streams Replication Manager with MirrorMaker 2 solves this easy.

 

But Apache NiFi could do this in a dual ingest fashion, but SRM is a no brainer.  Faster, automatic and Active-Active replication with full monitoring.

 

https://blog.cloudera.com/announcing-the-general-availability-of-cloudera-streams-management/

avatar
Master Guru

Also Cloudera has tools for Hive and other Data replication as part of CDP

avatar

Great, thanks a lot for your answers @TimothySpann . SRM seems great and works out-of-the-box! In my case, the proposed architecture is based on 2 hot clusters, each one with their kafka brokers but each one consuming independently from the sources. If primary kafka cluster breaks, secondary kafka cluster has to keep ingesting data from sources, not losing (or minimizing) downtime and loss of data. As far I can see, with SRM if primary kafka cluster breaks there's still the situation where secondary kafka cluster has to ingest and data doesn't have to be lost