I am seeing the following problem, and hope somebody might be able to provide some guidance.
I have a rudimentary Hadoop installation: Three virtual hosts; one HDFS NameNode and three HDFS DataNodes (one per host); one Yarn NodeManager and one Yarn ResourceManager per host; one HBase Master and three HBase Regionservers (one per host); three Zoo Keeper servers (one per host).
All these components are connected and operating seamlessly as a cluster. For example, the ZooKeeper servers have achieved quorum and elected a leader; HBase sees a single "global state", and tables have been created and data ingested, etc. Everything runs fine.
However, when I reboot one of my worker hosts (to simulate a host crash, in order to test resiliency of my cluster), although ZooKeeper instantly recovers and elects a new leader from the remaining two ZooKeeper servers, HBase seems to lose track of where things are. For example, from the HBase Master host, I can still list tables (using hbase shell), but cannot scan or get them; from the other, remaining HBase worker host (again, using hbase shell), I can't even list tables any tables (it appears as if zero tables have been defined).
If I (re-)start the Hadoop components on the rebooted worker host, the newly started ZooKeeper server connects with the other two, as I'd expect. But the global state for the HBase remains the same -- strangely disconnected, as I'd described. The only solution is to shut the entire cluster down and then bring it back up; then, everything is perfect once again.
I suspect that that momentary disruption of ZooKeeper (even though ZooKeeper instantly heals itself following it), is somehow affecting HBase in a manner that HBase can't recover from, but haven't been able to figure why, or what to do to prevent that. I am not running any HQuorumPeer processes, nor a backup HBase master at this time, just in case anybody was wondering....
Thanks in advance for any possible suggestions!