I am going to install PROD and COB environment on Cloud VMs. i am thinking about the right DR startegy. i have following options to do :
1.Using Cloudera BDR : i cant do it as i dont have cloudera manager,i have installed the cloudera hadoop,hive and impala using tar packages. so this is not the option.
2. Using racks to store the data : My PROD and COB are on diffrent data centres,so thought to create single Hadoop Cluster where i will make replication factor as 4 so that 2 copy goes to PROD and 2 to COB. i will have 2 namenodes one on PROD data centre and other in COB data centre. so that if entire data centre gets damaged zookeeper will automaticall make the namnode of COB data centre as active.
Here i am not able to follow this approach as my Cloud team said there is no Rack on Cloud based machines.
3. Create 2 independent hadoop cluster on PROD and COB data centre respectively and keep them in synch.
Kindly suggest me what to do,i still like the option 2. any valueable suggestion will be much appreciated.
From what I've seen and learned, making a single Hadoop cluster that spans data centers is not a good approach. One reason is that the slower network links between data centers impairs the performance of the cluster too much. Also, high availability functions for services like HDFS and YARN only allow two master-type daemons running (namenodes for HDFS, resource managers for YARN), so creating some arrangement with four of them would be challenging for sure.
Keeping data in cloud storage services that are naturally redundant and highly available is a better path. You can set up data replication to span availability zones or regions. Then, the COB (I assume that's "Continuity Of Business") cluster can be ready to use the data from its data center if something happens to production. Using cloud storage services is probably chaper and easier than sending data yourself from cluster to cluster.
Doing this effectively implies that data resident on the cluster, like in HDFS, is safe to lose if an outage occurs, and that it can be reconstructed on the other cluster from the same underlying data. The benefit of designing workloads like this, however, is that the clusters themselves become less critical, providing you can spin new ones up when necessary, using Director for example.
There are plenty of patterns that can be imagined here, so I'm interested in hearing what others have done too.