Created on 01-20-2014 04:00 AM - edited 09-16-2022 01:52 AM
Hi,
thinking of having two datacenters and the requirement of having a cluster surviving the failure of a whole datacenter, what would be the preferred setup?
a) ONE Hadoop cluster spanned over both data centers, or
b) TWO independent Hadoop clusters with (somehow) synced data
Questions:
many thanks in advance...Gerd...
Created 01-21-2014 09:30 AM
Yes, DistCP is usually what people use for that. It has rudimentary functionality for sync'ing data between clusters, albeit in a very busy cluster where files are being deleted/added frequently and/or other data is changing, replicating those changes between clusters will require custom logic on top of HDFS. Facebook developed their own replication layer, but it is proprietary to their engineering department.
Created 01-20-2014 01:04 PM
Cloudera Enterprise offers a backup and disaster recovery (BDR) tool which handles HDFS replication and other mechanisms like what you are seeking. I also wrote this blog entry regarding the different mechanisms that are available for HBase backup and disaster recovery. You didn't specify if you were using HBase, but that might help.
Some customers set up their user applications such that the data is written simultaneously to two clusters. This is a cheap form of replication. All data is written to cluster A and cluster B up front. You will have to write this code yourself and also make it fault tolerant, etc.
To answer your other questions, I would definitely recommend you have two independent clusters. One cluster spanning a WAN will not work very well, if at all.
Created 01-21-2014 12:00 AM
Hi Clint,
many thanks for your very helpful answer and the brilliant blog post about HBase repl.
There's just one more question:
If Cloudera Enterprise is no option ($$$) and the synchronisation needs to be done on the storage layer, is a repetition of calling distcp an appropriate low-cost solution, or how would you tackle this problem ?
br...: Gerd :....
Created 01-21-2014 09:30 AM
Yes, DistCP is usually what people use for that. It has rudimentary functionality for sync'ing data between clusters, albeit in a very busy cluster where files are being deleted/added frequently and/or other data is changing, replicating those changes between clusters will require custom logic on top of HDFS. Facebook developed their own replication layer, but it is proprietary to their engineering department.
Created 01-23-2014 12:41 PM
Clint, thank you very much.
Created 06-16-2016 08:05 PM
Can we monitor the namenode edits logs and use that to trigger file copy , continuously from one cluster to another.