About flyflap

flyflap · ‎11-02-2014

If you start the cluster manually (so you start all services manually) you dont even need the slaves file and ssh. This is only needed if you use the start scripts. The datanode will contact its configured namenode on startup and "offer" his service. So if you started the datanode manually, you should check its log file and see why he is not able to contact the namenode. br

flyflap · ‎07-29-2014

Looks like your zookeeper quorum was not able to elect a master. Maybe you have misconfigured your zookeeper? Make sure that you have entered all 3 servers in your zoo.cfg with a unique ID. Make sure you have the same config on all 3 of your machines and and make sure that every server is using the correct myId as specified in the cfg. BR Marc

flyflap · ‎01-22-2014

Thanks for your reply. The facebook link was interesting to read.Unfortunately we have it a bit more complicated, since we are developing a product which gets installed in customers datacenters and has to work with minimal manual interaction without losing any data. (You want your mobile phone bills to be correct 😉 ) If we go down that road, I would indeed follow your advise to shut the replication down on the surviving cluster and use snapshots to restore the failed cluster when it comes back online. Regards Marc

flyflap · ‎01-18-2014

Hi There, Since I am facing the challenge to create a Disaster Recovery solution for an HBase system serving millions of inserts and reads per hour, i took a deep dive in the concepts which are already there starting with the excellent overview in this Cloudera blog post http://blog.cloudera.com/blog/2013/11/approaches-to-backup-and-disaster-recovery-in-hbase/. But of course while investigating things, more and more questions arose. So i will try to ask here if someone can give me a hint. I think i would tend to use active/active replication, since it matches the problem i have to solve. My questions would be: 1.) How long can the peer cluster be absent before I start loosing data? I mean we are talking about "disaster" which could result in several weeks outage since you might be forced to get complete new hardware. On the other hand I guess the remaining cluster cannot hold the WALedits selected and queued for replication forever (or can he?) 2.) How do i practically restore the failed cluster after that, without stopping both systems? Suggested solution is to use copyTable, but this is useless in a system which has 10000 or more writes per second. Since we organize data in daily tables i could do that for all "day-1" tables. But what about the actual day table which is under heavy write load? Can i even use export and import table while replication is active? Somehow i feel there is a gap in the actual concepts. They are all focused on how to replicate the data (which is cool) but then how do i get it back? 3.) Is there a way to copy newest edits first in case of restoring the crashed cluster if it comes back online? So to say a "reverse" order restore, so that it can already serve requests for most recent data (which is highly expected to be the 80% usecase) Would be nice to hear if someone has encountered similar problems. Regards Marc

Online	Offline
Last Visited	‎04-05-2016 04:34 AM

Member Since	‎01-18-2014 12:07 PM
Last Visited	‎04-05-2016 04:34 AM
Posts	12
Kudos received	2

Cloudera Community

Re: CDH4 : Add new node to existing cluster

Re: FATAL ha.ZKFailoverController: Unable to start...

Re: CDH4 : Add new node to existing cluster

Re: FATAL ha.ZKFailoverController: Unable to start...

Re: Disaster Recovery questions

Disaster Recovery questions