Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Best practice for data replication/sync between two data centers

avatar
Guru

Hi,

thinking of having two datacenters and the requirement of having a cluster surviving the failure of a whole datacenter, what would be the preferred setup?

 

a) ONE Hadoop cluster spanned over both data centers, or

b) TWO independent Hadoop clusters with (somehow) synced data

 

Questions:

  • it seems obvious for option a) that the interconnection between the data centers needs to be veeery good, at least 1GBit ?!?
  • is it possible to configure Hadoop to replicate blocks to different data centers, in precedence of replicating to different racks via the rack topology script ?
  • if option b) is chosen, how can an automatic,continous data replication between the two clusters be established (are there tools for this) ?
  • what are the main considerations, recommendations for the initially mentioned requirement ?

 

many thanks in advance...Gerd...

1 ACCEPTED SOLUTION

avatar
Guru

Yes, DistCP is usually what people use for that.  It has rudimentary functionality for sync'ing data between clusters, albeit in a very busy cluster where files are being deleted/added frequently and/or other data is changing, replicating those changes between clusters will require custom logic on top of HDFS.  Facebook developed their own replication layer, but it is proprietary to their engineering department.

View solution in original post

5 REPLIES 5

avatar
Guru

Cloudera Enterprise offers a backup and disaster recovery (BDR) tool which handles HDFS replication and other mechanisms like what you are seeking.  I also wrote this blog entry regarding the different mechanisms that are available for HBase backup and disaster recovery.  You didn't specify if you were using HBase, but that might help.

 

Some customers set up their user applications such that the data is written simultaneously to two clusters.  This is a cheap form of replication.  All data is written to cluster A and cluster B up front.  You will have to write this code yourself and also make it fault tolerant, etc.

 

To answer your other questions, I would definitely recommend you have two independent clusters.  One cluster spanning a WAN will not work very well, if at all.

avatar
Guru

Hi Clint,

 

many thanks for your very helpful answer and the brilliant blog post about HBase repl.

 

There's just one more question:

If Cloudera Enterprise is no option ($$$) and the synchronisation needs to be done on the storage layer, is a repetition of calling distcp an appropriate low-cost solution, or how would you tackle this problem ?

 

br...: Gerd :....

avatar
Guru

Yes, DistCP is usually what people use for that.  It has rudimentary functionality for sync'ing data between clusters, albeit in a very busy cluster where files are being deleted/added frequently and/or other data is changing, replicating those changes between clusters will require custom logic on top of HDFS.  Facebook developed their own replication layer, but it is proprietary to their engineering department.

avatar
Guru

Clint, thank you very much.

avatar
New Contributor

Can we monitor the namenode edits logs and use that to trigger file copy , continuously from one cluster to another.