Created 04-27-2016 05:15 AM
I am investigating a good disaster recovery solution for banking with multiple petabytes of data. This would be data in HDFS (parquet, avro), Kafka, Hive and HBase.
Not just the data, but keeping BI tools in sync and having Spark jobs still function.
I have looked at WANDisco, but thats HBase and HDFS. Is there something to keep applications and BI items in sync.
Created 10-09-2019 08:45 AM
Also Cloudera has tools for Hive and other Data replication as part of CDP
Created 04-27-2016 06:18 AM
Cool thanks. We have tried to do the kafka mirroring and that has had a lot of issues. I am thinking NIFI can solve alot of these problems. I think it's a matter of budget. How many nodes of NIFI an extra nodes to help process this data migrating over.
A few people were thinking Dual Ingest, but that is hard to keep in sync usually. With NIFI, that should not be a problem.
I wonder if someone has a DR example in NIFI worked up already?
Created 10-04-2019 02:32 PM
Hello @TimothySpann , 3 years later from your post I'm in a similar situation :). Would you let me know how you solved it? In my company (telco) we're planning using 2 hot clusters with dual ingest because our RTO is demanding and looking for mechanisms to monitor and keep in sync both clusters
Created 10-09-2019 08:44 AM
Cloudera Streams Replication Manager with MirrorMaker 2 solves this easy.
But Apache NiFi could do this in a dual ingest fashion, but SRM is a no brainer. Faster, automatic and Active-Active replication with full monitoring.
https://blog.cloudera.com/announcing-the-general-availability-of-cloudera-streams-management/
Created 10-09-2019 08:45 AM
Also Cloudera has tools for Hive and other Data replication as part of CDP
Created 10-09-2019 10:18 AM
Great, thanks a lot for your answers @TimothySpann . SRM seems great and works out-of-the-box! In my case, the proposed architecture is based on 2 hot clusters, each one with their kafka brokers but each one consuming independently from the sources. If primary kafka cluster breaks, secondary kafka cluster has to keep ingesting data from sources, not losing (or minimizing) downtime and loss of data. As far I can see, with SRM if primary kafka cluster breaks there's still the situation where secondary kafka cluster has to ingest and data doesn't have to be lost