Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Hbase Master Failover

avatar
Contributor

We are running a Hbase Cluster with HBase slave and master configuration. We need a way to make sure when Hbase master active fails, the request from application should automatically be redirected to Hbase master in failover more. We run our cluster in ec2. We need some typical strategies used for this purpose. Any ideas ?

3 REPLIES 3

avatar
Super Guru

Apache HBase supports the ability to run multiple stand-by Master processes. These use Apache ZooKeeper to perform leader election. When/if the active Master fails, a stand-by will take its place.

You should use Apache Ambari to coordinate the HBase Master running on multiple nodes.

avatar
Contributor

Thanks Josh for explaining! A follow question would be what does Master "fail" mean? Sometimes we notice our app talking HBase does not respond. And, then we notice that the Hbase master it was talking to had a resource alert on Ambari UI. And, we do not know if the failover was even initiated(only when one master crashes? What about if it is just slow?). When does failover get initiated? Can you point to some article which talks about it? I can't seem to find one easily online.

We also tried to do this:

1. talk to the active master ; Application writehbase-hbase-master-regionserver02stormca-internall.zips to the active master some data. Confirmed that the data was there in the active master by scanning the table on hbase shell.

2. Then failed over (to simulate failover) artificially. Then we did the same as above. List shows the table. But, scan on the table fails. (no response).

Attached the log with relevant errors.

avatar
Super Guru

Please read http://hbase.apache.org/book.html#architecture.master and http://hbase.apache.org/book.html#architecture.master. It sounds like your interpretation on what the HBase master does is very wrong. The HBase master is *not* involved with data access requests (scans in the shell).

The verb "fail" means that when the process either stops or crashes for some reason. Sometimes processes can pause, yet recover. Part of this relates to the timeout in holding the ZooKeeper lock for distributed mutual exclusion between multiple Master processes. By default, that timeout is 30-60seconds at worst.