01-18-2014 01:49 PM
Since I am facing the challenge to create a Disaster Recovery solution for an HBase system serving millions of inserts and reads per hour, i took a deep dive in the concepts which are already there starting with the excellent overview in this Cloudera blog post http://blog.cloudera.com/blog/2013/11/approaches-to-backup-and-disaster-recovery-in-hbase/.
But of course while investigating things, more and more questions arose. So i will try to ask here if someone can give me a hint.
I think i would tend to use active/active replication, since it matches the problem i have to solve. My questions would be:
1.) How long can the peer cluster be absent before I start loosing data? I mean we are talking about "disaster" which could result in several weeks outage since you might be forced to get complete new hardware. On the other hand I guess the remaining cluster cannot hold the WALedits selected and queued for replication forever (or can he?)
2.) How do i practically restore the failed cluster after that, without stopping both systems? Suggested solution is to use copyTable, but this is useless in a system which has 10000 or more writes per second. Since we organize data in daily tables i could do that for all "day-1" tables. But what about the actual day table which is under heavy write load? Can i even use export and import table while replication is active? Somehow i feel there is a gap in the actual concepts. They are all focused on how to replicate the data (which is cool) but then how do i get it back?
3.) Is there a way to copy newest edits first in case of restoring the crashed cluster if it comes back online? So to say a "reverse" order restore, so that it can already serve requests for most recent data (which is highly expected to be the 80% usecase)
Would be nice to hear if someone has encountered similar problems.
01-21-2014 10:00 AM - edited 01-21-2014 10:17 AM
These are difficult challenges, to be sure, but I will attempt to address your concerns here:
1) The theoretic limit to how large your replication queue can grow is the size of your HDFS storage, since the items needing replication are stored in HBase WALs on HDFS. I'm not aware of any real world tests that have been done to prove that limit, though.
In your case it sounds like your replication queue would definitely grow very large very rapidly, so this is something to take into consideration. For an extended outage like you describe (eg. weeks) you might actually be better off just disabling replication on the surviving cluster and start over from scratch once you have restored the downed cluster. In other words, restore a fresh copy of your data to the new cluster and then enable replication again. Snapshots would be a good way to reasonably get a current image copied to the restored cluster. More on that below...
2) The backup and disaster recovery options I described in the blog do take "restore" into account and provide functionality for that, but they largely leave the implementation up to you. In other words, you will have to come up with a gameplan that works for your environment/infrastructure and test it. As I alluded above, I might recommend that you disable replication on the surviving cluster for an extended outage on the order of weeks. Once a cluster is available for replication again, you could export a snapshot of your existing cluster and use bulk loads to load the data into the new cluster. This can be done on a per-table basis, so you can choose your most critical tables first and enable replication on them at the same time in order to get your data mostly in sync. Note that there might still be some data that gets out of sync since your cluster is so active, this would require a bit of manual intervention to add/update the rows that are not in sync. However, once they are sync'd up, replication will take over again and you'll be fine. As you indicated, the CopyTable and Export functions would not be preferable as they would put a heavy MapReduce and HBase API load on your source cluster. Snapshots do not introduce such a load during their creation and although running "exportSnapshot" (or even using DistCP to copy the resulting snapshot to the restored cluster) will create a Mapreduce job, that job will not burden HBase as much as it will not make API calls to the tables at all.
3) As previously stated, I think your best bet here is to restore the most active tables first. Since you don't want to make the data in the restored cluster available for client requests until it's all sync'd up with the surviving cluster anyway (for data accuracy sake), it's really a moot point to attempt to restore data in a reverse order. Nonetheless, no such functionality exists that I'm aware of.
For an interesting read, see how Facebook did it.
01-22-2014 12:36 PM
Thanks for your reply.
The facebook link was interesting to read.Unfortunately we have it a bit more complicated, since we are developing a product which gets installed in customers datacenters and has to work with minimal manual interaction without losing any data. (You want your mobile phone bills to be correct ;) )
If we go down that road, I would indeed follow your advise to shut the replication down on the surviving cluster and use snapshots to restore the failed cluster when it comes back online.