Support Questions

Find answers, ask questions, and share your expertise

HBase Master and Region Server shutting down with ZooKeeper delete failed after 4 attempts

avatar

We installed HDP 2.3.4 cluster with Ambari 2.2..

HBase Master and Region servers starts but after some time the HBase Master shuts down.

The log file says:

2016-01-25 14:46:47,340 WARN  [master/node03.test.com/x.x.x.x:16000] zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper, quorum=node03.test.com:2181,node02.test.com:2181,node01.test.com:2181, exception=org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /hbase-unsecure/master
2016-01-25 14:46:47,340 ERROR [master/node03.test.com/x.x.x.x:16000] zookeeper.RecoverableZooKeeper: ZooKeeper getData failed after 4 attempts
2016-01-25 14:46:47,340 WARN  [master/node03.test.com/x.x.x.x:16000] zookeeper.ZKUtil: master:16000-0x3527a1898200012, quorum=node03.test.com:2181,node02.test.com:2181,node01.test.com:2181, baseZNode=/hbase-unsecure Unable to get data of znode /hbase-unsecure/master
org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /hbase-unsecure/master
    at org.apache.zookeeper.KeeperException.create(KeeperException.java:127)
    at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
    at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1155)
    at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.getData(RecoverableZooKeeper.java:359)
    at org.apache.hadoop.hbase.zookeeper.ZKUtil.getData(ZKUtil.java:745)
    at org.apache.hadoop.hbase.zookeeper.MasterAddressTracker.getMasterAddress(MasterAddressTracker.java:148)
    at org.apache.hadoop.hbase.master.ActiveMasterManager.stop(ActiveMasterManager.java:267)
    at org.apache.hadoop.hbase.master.HMaster.stopServiceThreads(HMaster.java:1164)
    at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:1071)
    at java.lang.Thread.run(Thread.java:745)
2016-01-25 14:46:47,340 ERROR [master/node03.test.com/x.x.x.x:16000] zookeeper.ZooKeeperWatcher: master:16000-0x3527a1898200012, quorum=node03.test.com:2181,node02.test.com:2181,node01.test.com:2181, baseZNode=/hbase-unsecure Received unexpected KeeperException, re-throwing exception
org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /hbase-unsecure/master
    at org.apache.zookeeper.KeeperException.create(KeeperException.java:127)
    at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
    at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1155)
    at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.getData(RecoverableZooKeeper.java:359)
    at org.apache.hadoop.hbase.zookeeper.ZKUtil.getData(ZKUtil.java:745)
    at org.apache.hadoop.hbase.zookeeper.MasterAddressTracker.getMasterAddress(MasterAddressTracker.java:148)
    at org.apache.hadoop.hbase.master.ActiveMasterManager.stop(ActiveMasterManager.java:267)
    at org.apache.hadoop.hbase.master.HMaster.stopServiceThreads(HMaster.java:1164)
    at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:1071)
    at java.lang.Thread.run(Thread.java:745)
2016-01-25 14:46:47,340 ERROR [master/node03.test.com/x.x.x.x:16000] master.ActiveMasterManager: master:16000-0x3527a1898200012, quorum=node03.test.com:2181,node02.test.com:2181,node01.test.com:2181, baseZNode=/hbase-unsecure Error deleting our own master address node
org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /hbase-unsecure/master
    at org.apache.zookeeper.KeeperException.create(KeeperException.java:127)
    at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
    at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1155)
    at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.getData(RecoverableZooKeeper.java:359)
    at org.apache.hadoop.hbase.zookeeper.ZKUtil.getData(ZKUtil.java:745)
    at org.apache.hadoop.hbase.zookeeper.MasterAddressTracker.getMasterAddress(MasterAddressTracker.java:148)
    at org.apache.hadoop.hbase.master.ActiveMasterManager.stop(ActiveMasterManager.java:267)
    at org.apache.hadoop.hbase.master.HMaster.stopServiceThreads(HMaster.java:1164)
    at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:1071)
    at java.lang.Thread.run(Thread.java:745)
2016-01-25 14:46:47,341 INFO  [master/node03.test.com/x.x.x.x:16000] hbase.ChoreService: Chore service for: node03.test.com,16000,1453750627948_splitLogManager_ had [] on shutdown
2016-01-25 14:46:47,341 INFO  [master/node03.test.com/x.x.x.x:16000] flush.MasterFlushTableProcedureManager: stop: server shutting down.
2016-01-25 14:46:47,342 INFO  [master/node03.test.com/x.x.x.x:16000] ipc.RpcServer: Stopping server on 16000
2016-01-25 14:46:47,342 INFO  [RpcServer.listener,port=16000] ipc.RpcServer: RpcServer.listener,port=16000: stopping
2016-01-25 14:46:47,343 INFO  [RpcServer.responder] ipc.RpcServer: RpcServer.responder: stopped
2016-01-25 14:46:47,343 INFO  [RpcServer.responder] ipc.RpcServer: RpcServer.responder: stopping
2016-01-25 14:46:47,345 WARN  [master/node03.test.com/x.x.x.x:16000] zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper, quorum=node03.test.com:2181,node02.test.com:2181,node01.test.com:2181, exception=org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /hbase-unsecure/rs/node03.test.com,16000,1453750627948
2016-01-25 14:46:48,345 WARN  [master/node03.test.com/x.x.x.x:16000] zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper, quorum=node03.test.com:2181,node02.test.com:2181,node01.test.com:2181, exception=org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /hbase-unsecure/rs/node03.test.com,16000,1453750627948
2016-01-25 14:46:50,345 WARN  [master/node03.test.com/x.x.x.x:16000] zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper, quorum=node03.test.com:2181,node02.test.com:2181,node01.test.com:2181, exception=org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /hbase-unsecure/rs/node03.test.com,16000,1453750627948
2016-01-25 14:46:54,346 WARN  [master/node03.test.com/x.x.x.x:16000] zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper, quorum=node03.test.com:2181,node02.test.com:2181,node01.test.com:2181, exception=org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /hbase-unsecure/rs/node03.test.com,16000,1453750627948
2016-01-25 14:47:02,346 WARN  [master/node03.test.com/x.x.x.x:16000] zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper, quorum=node03.test.com:2181,node02.test.com:2181,node01.test.com:2181, exception=org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /hbase-unsecure/rs/node03.test.com,16000,1453750627948
2016-01-25 14:47:02,346 ERROR [master/node03.test.com/x.x.x.x:16000] zookeeper.RecoverableZooKeeper: ZooKeeper delete failed after 4 attempts
2016-01-25 14:47:02,347 WARN  [master/node03.test.com/x.x.x.x:16000] regionserver.HRegionServer: Failed deleting my ephemeral node
org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /hbase-unsecure/rs/node03.test.com,16000,1453750627948
    at org.apache.zookeeper.KeeperException.create(KeeperException.java:127)
    at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
    at org.apache.zookeeper.ZooKeeper.delete(ZooKeeper.java:873)
    at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.delete(RecoverableZooKeeper.java:178)
    at org.apache.hadoop.hbase.zookeeper.ZKUtil.deleteNode(ZKUtil.java:1345)
    at org.apache.hadoop.hbase.zookeeper.ZKUtil.deleteNode(ZKUtil.java:1334)
    at org.apache.hadoop.hbase.regionserver.HRegionServer.deleteMyEphemeralNode(HRegionServer.java:1403)
    at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:1079)
    at java.lang.Thread.run(Thread.java:745)
2016-01-25 14:47:02,350 INFO  [master/node03.test.com/x.x.x.x:16000] regionserver.HRegionServer: stopping server node03.test.com,16000,1453750627948; zookeeper connection closed.
2016-01-25 14:47:02,351 INFO  [master/node03.test.com/x.x.x.x:16000] regionserver.HRegionServer: master/node03.test.com/x.x.x.x:16000 exiting

What steps do I take to solve this?

1 ACCEPTED SOLUTION

avatar
Super Guru

ZooKeeper Session expirations (e.g. "org.apache.zookeeper.KeeperException$SessionExpiredException:KeeperErrorCode=Session expired for/hbase-unsecure/rs/node03.test.com,16000,1453750627948") are a common issue that HBase can run into, although there are often a number of reasons which could cause this error.

Background: ZooKeeper clients need to maintain a "heartbeat" (a regular RPC to a ZooKeeper server) to maintain their session (credits to Apache ZooKeeper: http://zookeeper.apache.org/doc/r3.4.6/images/state_dia.jpg)

1585-zk-state-dia.jpg

Without an active Session, clients cannot interact with ZooKeeper. In the HBase context, this means that HBase will continually try to connect to the server to put whatever state is necessary.

Some common reasons that this heartbeat fails:

  • JVM garbage collections. Pauses in the HBase JVM can cause the heartbeat to not occur regularly. Look at JVM GC logs to identify this. HBase will also include messages in the service logs when it notices this happening.
  • Host-level pauses. The node moving parts of the JVM out of memory and into swap (backed by disk) can cause pauses the entire JVM without any garbage collections. HBase will often notice this too, but report a pause without any garbage collections happening.
  • Excess clients to ZooKeeper -- exceeding `maxClientCnxns` in zoo.cfg. See the ZooKeeper docs for a full description of this property, but the summary is that this property is used to rate-limit connections from one host. This is meant to prevent a denial of service attack, however, excessive load from a combination of HBase, MapReduce and other services can trigger this protection. In this case, clients are regularly trying to heartbeat to ZooKeeper, but the ZooKeeper server is actively *denying* the connection because it would exceed `maxClientCnxns` number of connections from the same host. You can use `netstat` to verify this is happening and/or check the ZooKeeper server log.

While removing the contents of HBase's root znode can be a temporary fix (especially with older versions which heavily rely on ZooKeeper for region assignments), this is often indicative of a much larger problem which will continue to occur in the future.

View solution in original post

4 REPLIES 4

avatar
Super Guru

ZooKeeper Session expirations (e.g. "org.apache.zookeeper.KeeperException$SessionExpiredException:KeeperErrorCode=Session expired for/hbase-unsecure/rs/node03.test.com,16000,1453750627948") are a common issue that HBase can run into, although there are often a number of reasons which could cause this error.

Background: ZooKeeper clients need to maintain a "heartbeat" (a regular RPC to a ZooKeeper server) to maintain their session (credits to Apache ZooKeeper: http://zookeeper.apache.org/doc/r3.4.6/images/state_dia.jpg)

1585-zk-state-dia.jpg

Without an active Session, clients cannot interact with ZooKeeper. In the HBase context, this means that HBase will continually try to connect to the server to put whatever state is necessary.

Some common reasons that this heartbeat fails:

  • JVM garbage collections. Pauses in the HBase JVM can cause the heartbeat to not occur regularly. Look at JVM GC logs to identify this. HBase will also include messages in the service logs when it notices this happening.
  • Host-level pauses. The node moving parts of the JVM out of memory and into swap (backed by disk) can cause pauses the entire JVM without any garbage collections. HBase will often notice this too, but report a pause without any garbage collections happening.
  • Excess clients to ZooKeeper -- exceeding `maxClientCnxns` in zoo.cfg. See the ZooKeeper docs for a full description of this property, but the summary is that this property is used to rate-limit connections from one host. This is meant to prevent a denial of service attack, however, excessive load from a combination of HBase, MapReduce and other services can trigger this protection. In this case, clients are regularly trying to heartbeat to ZooKeeper, but the ZooKeeper server is actively *denying* the connection because it would exceed `maxClientCnxns` number of connections from the same host. You can use `netstat` to verify this is happening and/or check the ZooKeeper server log.

While removing the contents of HBase's root znode can be a temporary fix (especially with older versions which heavily rely on ZooKeeper for region assignments), this is often indicative of a much larger problem which will continue to occur in the future.

avatar
Expert Contributor

On my cluster NiFi and one of three HBase Region Server run on same server. I modified NiFi boostrap.conf file and uncommented java.arg.13=-XX:+UseG1GC then Region Server stopped. I tried many times to restart, once it started soon it stopped again till I commented out java.arg.13=-XX:+UseG1GC property. It now works. I think the property changes the server's JVM garbage collections style.

avatar
Expert Contributor

Setting timeouts from HBase conf did not work for me. tickTime in ZK was getting picked for session. Here's more info: https://superuser.blog/hbase-dead-regionserver/

avatar
New Contributor

In my cloudera cluster, I met this issue. I restarted zookeeper, HBASE service. And it is working.