Support Questions

Find answers, ask questions, and share your expertise

I have running a 12 node cluster, with Hbase master and 12 Region Server. can any one please give me a exact reason or RCA or suggestion why HBase region server is down, and is zookeeper client is manadotry on all the HBase Region server to not?

avatar
Rising Star

Log file:

2017-06-12 23:06:29,603 ERROR [regionserver/sddsvrwm383.scglobaluat.aduat.com/172.25.12.67:16020] zookeeper.RecoverableZooKeeper: ZooKeeper getChildren failed after 7 attempts

2017-06-12 23:06:29,603 WARN [regionserver/sddsvrwm383.scglobaluat.aduat. com/172.25.12.67:16020] zookeeper.ZKUtil: regionserver:16020-0x35c93984f940fd2, quorum=sddsvrwm369.scglobaluat.aduat. com:2181,sddsvrwm367.scglobaluat.aduat. com:2181,sddsvrwm368.scglobaluat.aduat.com:2181, baseZNode=/hbase-secure Unable to list children of znode /hbase-secure/replication/rs/sddsvrwm383.scglobaluat.aduat. com,16020,1497136332052

org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /hbase-secure/replication/rs/sddsvrwm383.scglobaluat.aduat. com,16020,1497136332052

at org.apache.zookeeper.KeeperException.create(KeeperException.java:127)

at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)

at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1472)

at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.getChildren(RecoverableZooKeeper.java:295)

at org.apache.hadoop.hbase.zookeeper.ZKUtil.listChildrenAndWatchForNewChildren(ZKUtil.java:455)

at org.apache.hadoop.hbase.zookeeper.ZKUtil.listChildrenAndWatchThem(ZKUtil.java:483)

at org.apache.hadoop.hbase.zookeeper.ZKUtil.listChildrenBFSAndWatchThem(ZKUtil.java:1462)

at org.apache.hadoop.hbase.zookeeper.ZKUtil.deleteNodeRecursivelyMultiOrSequential(ZKUtil.java:1384)

at org.apache.hadoop.hbase.zookeeper.ZKUtil.deleteNodeRecursively(ZKUtil.java:1266)

at org.apache.hadoop.hbase.replication.ReplicationQueuesZKImpl.removeAllQueues(ReplicationQueuesZKImpl.java:196)

at org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.join(ReplicationSourceManager.java:302)

at org.apache.hadoop.hbase.replication.regionserver.Replication.join(Replication.java:202)

at org.apache.hadoop.hbase.replication.regionserver.Replication.stopReplicationService(Replication.java:194)

at org.apache.hadoop.hbase.regionserver.HRegionServer.stopServiceThreads(HRegionServer.java:2163)

at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:1090)

at java.lang.Thread.run(Thread.java:745)

2017-06-12 23:06:29,604 ERROR [regionserver/sddsvrwm383.scglobaluat.aduat.com/172.25.12.67:16020] zookeeper.ZooKeeperWatcher: regionserver:16020-0x35c93984f940fd2, quorum=sddsvrwm369.scglobaluat.aduat.com:2181,sddsvrwm367.scglobaluat.aduat.com:2181,sddsvrwm368.scglobaluat.aduat.com:2181, baseZNode=/hbase-secure Received unexpected KeeperException, re-throwing exception

org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /hbase-secure/replication/rs/sddsvrwm383.scglobaluat.aduat. com,16020,1497136332052

at org.apache.zookeeper.KeeperException.create(KeeperException.java:127)

at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)

at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1472)

at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.getChildren(RecoverableZooKeeper.java:295)

at org.apache.hadoop.hbase.zookeeper.ZKUtil.listChildrenAndWatchForNewChildren(ZKUtil.java:455)

at org.apache.hadoop.hbase.zookeeper.ZKUtil.listChildrenAndWatchThem(ZKUtil.java:483)

at org.apache.hadoop.hbase.zookeeper.ZKUtil.listChildrenBFSAndWatchThem(ZKUtil.java:1462)

at org.apache.hadoop.hbase.zookeeper.ZKUtil.deleteNodeRecursivelyMultiOrSequential(ZKUtil.java:1384)

at org.apache.hadoop.hbase.zookeeper.ZKUtil.deleteNodeRecursively(ZKUtil.java:1266)

at org.apache.hadoop.hbase.replication.ReplicationQueuesZKImpl.removeAllQueues(ReplicationQueuesZKImpl.java:196)

at org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.join(ReplicationSourceManager.java:302)

at org.apache.hadoop.hbase.replication.regionserver.Replication.join(Replication.java:202)

at org.apache.hadoop.hbase.replication.regionserver.Replication.stopReplicationService(Replication.java:194)

at org.apache.hadoop.hbase.regionserver.HRegionServer.stopServiceThreads(HRegionServer.java:2163)

at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:1090)

at java.lang.Thread.run(Thread.java:745)

2017-06-12 23:06:29,605 INFO [regionserver/sddsvrwm383.scglobaluat.aduat.com/172.25.12.67:16020] ipc.RpcServer: Stopping server on 16020

2017-06-12 23:06:29,606 INFO [regionserver/sddsvrwm383.scglobaluat.aduat. com/172.25.12.67:16020] token.AuthenticationTokenSecretManager: Stopping leader election, because: SecretManager stopping

2017-06-12 23:06:29,607 INFO [RpcServer.listener,port=16020] ipc.RpcServer: RpcServer.listener,port=16020: stopping

2017-06-12 23:06:29,614 INFO [RpcServer.responder] ipc.RpcServer: RpcServer.responder: stopped

2017-06-12 23:06:29,614 INFO [RpcServer.responder] ipc.RpcServer: RpcServer.responder: stopping

2017-06-12 23:06:29,619 WARN [regionserver/sddsvrwm383.scglobaluat.aduat.com/172.25.12.67:16020] zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper, quorum=sddsvrwm369.scglobaluat.aduat. com:2181,sddsvrwm367.scglobaluat.aduat. com:2181,sddsvrwm368.scglobaluat.aduat. com:2181, exception=org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /hbase-secure/rs/sddsvrwm383.scglobaluat.aduat. com,16020,1497136332052

2017-06-12 23:07:29,625 INFO [HBase-Metrics2-1] impl.MetricsSystemImpl: Stopping HBase metrics system...

2017-06-12 23:07:29,625 INFO [timeline] impl.MetricsSinkAdapter: timeline thread interrupted.

2017-06-12 23:07:29,628 INFO [HBase-Metrics2-1] impl.MetricsSystemImpl: HBase metrics system stopped.

2017-06-12 23:07:29,628 INFO [pool-684-thread-1] timeline.HadoopTimelineMetricsSink: Closing HadoopTimelineMetricSink. Flushing metrics to collector...

2017-06-12 23:07:30,132 INFO [HBase-Metrics2-1] impl.MetricsConfig: loaded properties from hadoop-metrics2-hbase.properties

2017-06-12 23:07:30,148 INFO [HBase-Metrics2-1] timeline.HadoopTimelineMetricsSink: Initializing Timeline metrics sink.

2017-06-12 23:07:30,148 INFO [HBase-Metrics2-1] timeline.HadoopTimelineMetricsSink: Identified hostname = sddsvrwm383.scglobaluat.aduat.com, serviceName = hbase

2017-06-12 23:07:30,148 INFO [HBase-Metrics2-1] timeline.HadoopTimelineMetricsSink: Collector Uri: http://sddsvrwm368.scglobaluat.aduat. com:6188/ws/v1/timeline/metrics

2017-06-12 23:07:30,151 INFO [HBase-Metrics2-1] impl.MetricsSinkAdapter: Sink timeline started

2017-06-12 23:07:30,151 INFO [HBase-Metrics2-1] impl.MetricsSystemImpl: Scheduled snapshot period at 10 second(s).

2017-06-12 23:07:30,151 INFO [HBase-Metrics2-1] impl.MetricsSystemImpl: HBase metrics system started

2017-06-12 23:07:32,620 WARN [regionserver/sddsvrwm383.scglobaluat.aduat. com/172.25.12.67:16020] zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper, quorum=sddsvrwm369.scglobaluat.aduat. com:2181,sddsvrwm367.scglobaluat.aduat. com:2181,sddsvrwm368.scglobaluat.aduat.com:2181, exception=org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /hbase-secure/rs/sddsvrwm383.scglobaluat.aduat. com,16020,1497136332052

2017-06-12 23:08:36,621 WARN [regionserver/sddsvrwm383.scglobaluat.aduat. com/172.25.12.67:16020] zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper, quorum=sddsvrwm369.scglobaluat.aduat. com:2181,sddsvrwm367.scglobaluat.aduat. com:2181,sddsvrwm368.scglobaluat.aduat. com:2181, exception=org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /hbase-secure/rs/sddsvrwm383.scglobaluat.aduat. com,16020,1497136332052

2017-06-12 23:08:36,621 ERROR [regionserver/sddsvrwm383.scglobaluat.aduat. com/172.25.12.67:16020] zookeeper.RecoverableZooKeeper: ZooKeeper delete failed after 7 attempts

2017-06-12 23:08:36,621 WARN [regionserver/sddsvrwm383.scglobaluat.aduat. com/172.25.12.67:16020] regionserver.HRegionServer: Failed deleting my ephemeral node

org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /hbase-secure/rs/sddsvrwm383.scglobaluat.aduat. com,16020,1497136332052

at org.apache.zookeeper.KeeperException.create(KeeperException.java:127)

at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)

at org.apache.zookeeper.ZooKeeper.delete(ZooKeeper.java:873)

at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.delete(RecoverableZooKeeper.java:178)

at org.apache.hadoop.hbase.zookeeper.ZKUtil.deleteNode(ZKUtil.java:1222)

at org.apache.hadoop.hbase.zookeeper.ZKUtil.deleteNode(ZKUtil.java:1211)

at org.apache.hadoop.hbase.regionserver.HRegionServer.deleteMyEphemeralNode(HRegionServer.java:1427)

at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:1098)

at java.lang.Thread.run(Thread.java:745)

2017-06-12 23:08:36,622 INFO [regionserver/sddsvrwm383.scglobaluat.aduat. com/172.25.12.67:16020] regionserver.HRegionServer: stopping server sddsvrwm383.scglobaluat.aduat. com,16020,1497136332052; zookeeper connection closed.

2017-06-12 23:08:36,622 INFO [regionserver/sddsvrwm383.scglobaluat.aduat. com/172.25.12.67:16020] regionserver.HRegionServer: regionserver/sddsvrwm383.scglobaluat.aduat. com/172.25.12.67:16020 exiting

2017-06-12 23:08:36,627 ERROR [main] regionserver.HRegionServerCommandLine: Region server exiting

java.lang.RuntimeException: HRegionServer Aborted

at org.apache.hadoop.hbase.regionserver.HRegionServerCommandLine.start(HRegionServerCommandLine.java:68)

at org.apache.hadoop.hbase.regionserver.HRegionServerCommandLine.run(HRegionServerCommandLine.java:87)

at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)

at org.apache.hadoop.hbase.util.ServerCommandLine.doMain(ServerCommandLine.java:126)

at org.apache.hadoop.hbase.regionserver.HRegionServer.main(HRegionServer.java:2681)

2017-06-12 23:08:36,633 INFO [pool-5-thread-1] provider.AuditProviderFactory: ==> JVMShutdownHook.run()

2017-06-12 23:08:36,633 INFO [pool-5-thread-1] provider.AuditProviderFactory: JVMShutdownHook: Signalling async audit cleanup to start.

2017-06-12 23:08:36,633 INFO [pool-5-thread-1] provider.AuditProviderFactory: JVMShutdownHook: Waiting up to 30 seconds for audit cleanup to finish.

2017-06-12 23:08:36,634 INFO [Ranger async Audit cleanup] provider.AuditProviderFactory: RangerAsyncAuditCleanup: Starting cleanup

2017-06-12 23:08:36,635 INFO [Ranger async Audit cleanup] destination.HDFSAuditDestination: Flush HDFS audit logs completed.....

2017-06-12 23:08:36,635 INFO [Ranger async Audit cleanup] queue.AuditAsyncQueue: Stop called. name=hbaseRegional.async

2017-06-12 23:08:36,635 INFO [Ranger async Audit cleanup] queue.AuditAsyncQueue: Interrupting consumerThread. name=hbaseRegional.async, consumer=hbaseRegional.async.summary

2017-06-12 23:08:36,635 INFO [Ranger async Audit cleanup] provider.AuditProviderFactory: RangerAsyncAuditCleanup: Done cleanup

2017-06-12 23:08:36,635 INFO [Ranger async Audit cleanup] provider.AuditProviderFactory: RangerAsyncAuditCleanup: Waiting to audit cleanup start signal

2017-06-12 23:08:36,635 INFO [pool-5-thread-1] provider.AuditProviderFactory: JVMShutdownHook: Audit cleanup finished after 2 milli seconds

2017-06-12 23:08:36,635 INFO [pool-5-thread-1] provider.AuditProviderFactory: JVMShutdownHook: Interrupting ranger async audit cleanup thread

2017-06-12 23:08:36,635 INFO [pool-5-thread-1] provider.AuditProviderFactory: <== JVMShutdownHook.run()

2017-06-12 23:08:36,635 INFO [Ranger async Audit cleanup] provider.AuditProviderFactory: RangerAsyncAuditCleanup: Interrupted while waiting for audit startCleanup signal! Exiting the thread...

java.lang.InterruptedException

at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:998)

at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)

at java.util.concurrent.Semaphore.acquire(Semaphore.java:312)

at org.apache.ranger.audit.provider.AuditProviderFactory$RangerAsyncAuditCleanup.run(AuditProviderFactory.java:487)

at java.lang.Thread.run(Thread.java:745)

2017-06-12 23:08:36,636 INFO [org.apache.ranger.audit.queue.AuditAsyncQueue0] queue.AuditAsyncQueue: Caught exception in consumer thread. Shutdown might be in progress

2017-06-12 23:08:36,636 INFO [org.apache.ranger.audit.queue.AuditAsyncQueue0] queue.AuditAsyncQueue: Exiting polling loop. name=hbaseRegional.async

2017-06-12 23:08:36,636 INFO [org.apache.ranger.audit.queue.AuditAsyncQueue0] queue.AuditAsyncQueue: Calling to stop consumer. name=hbaseRegional.async, consumer.name=hbaseRegional.async.summary

2017-06-12 23:08:36,636 INFO [org.apache.ranger.audit.queue.AuditAsyncQueue0] queue.AuditSummaryQueue: Stop called. name=hbaseRegional.async.summary

2017-06-12 23:08:36,636 INFO [org.apache.ranger.audit.queue.AuditAsyncQueue0] queue.AuditSummaryQueue: Interrupting consumerThread. name=hbaseRegional.async.summary, consumer=hbaseRegional.async.summary.batch

2017-06-12 23:08:36,636 INFO [org.apache.ranger.audit.queue.AuditAsyncQueue0] queue.AuditAsyncQueue: Exiting consumerThread.run() method. name=hbaseRegional.async

2017-06-12 23:08:36,636 INFO [org.apache.ranger.audit.queue.AuditSummaryQueue0] queue.AuditSummaryQueue: Caught exception in consumer thread. Shutdown might be in progress

2017-06-12 23:08:36,637 INFO [org.apache.ranger.audit.queue.AuditSummaryQueue0] queue.AuditSummaryQueue: Exiting polling loop. name=hbaseRegional.async.summary

2017-06-12 23:08:36,637 INFO [org.apache.ranger.audit.queue.AuditSummaryQueue0] queue.AuditSummaryQueue: Calling to stop consumer. name=hbaseRegional.async.summary, consumer.name=hbaseRegional.async.summary.batch

2017-06-12 23:08:36,637 INFO [org.apache.ranger.audit.queue.AuditSummaryQueue0] queue.AuditBatchQueue: Stop called. name=hbaseRegional.async.summary.batch

2017-06-12 23:08:36,637 INFO [pool-5-thread-1] regionserver.ShutdownHook: Shutdown hook starting; hbase.shutdown.hook=true; fsShutdownHook=org.apache.hadoop.fs.FileSystem$Cache$ClientFinalizer@4f3faa70

2017-06-12 23:08:36,637 INFO [org.apache.ranger.audit.queue.AuditSummaryQueue0] destination.HDFSAuditDestination: Flush HDFS audit logs completed.....

2017-06-12 23:08:36,637 INFO [org.apache.ranger.audit.queue.AuditSummaryQueue0] queue.AuditBatchQueue: Interrupting consumerThread. name=hbaseRegional.async.summary.batch, consumer=hbaseRegional.async.summary.batch.hdfs

2017-06-12 23:08:36,637 INFO [org.apache.ranger.audit.queue.AuditSummaryQueue0] queue.AuditSummaryQueue: Exiting consumerThread.run() method. name=hbaseRegional.async.summary

name=hbaseRegional.async.summary

2017-06-12 23:08:36,638 INFO [pool-5-thread-1] regionserver.ShutdownHook: Starting fs shutdown hook thread.

2017-06-12 23:08:36,637 INFO [org.apache.ranger.audit.queue.AuditBatchQueue0] queue.AuditBatchQueue: Caught exception in consumer thread. Shutdown might be in progress

2017-06-12 23:08:36,638 INFO [org.apache.ranger.audit.queue.AuditBatchQueue0] queue.AuditBatchQueue: Exiting consumerThread. Queue=hbaseRegional.async.summary.batch, dest=hbaseRegional.async.summary.batch.hdfs

2017-06-12 23:08:36,638 INFO [org.apache.ranger.audit.queue.AuditBatchQueue0] queue.AuditBatchQueue: Calling to stop consumer. name=hbaseRegional.async.summary.batch, consumer.name=hbaseRegional.async.summary.batch.hdfs

2017-06-12 23:08:36,651 INFO [org.apache.ranger.audit.queue.AuditBatchQueue0] provider.BaseAuditHandler: Audit Status Log: name=hbaseRegional.async.summary.batch.hdfs, interval=04:18.058 minutes, events=2, succcessCount=2, totalEvents=12, totalSuccessCount=12

2017-06-12 23:08:36,651 INFO [org.apache.ranger.audit.queue.AuditBatchQueue0] queue.AuditFileSpool: Stop called, queueName=hbaseRegional.async.summary.batch, consumer=hbaseRegional.async.summary.batch.hdfs

2017-06-12 23:08:36,652 INFO [org.apache.ranger.audit.queue.AuditBatchQueue0] queue.AuditBatchQueue: Exiting consumerThread.run() method. name=hbaseRegional.async.summary.batch

2017-06-12 23:08:36,651 INFO [hbaseRegional.async.summary.batch_hbaseRegional.async.summary.batch.hdfs_destWriter] queue.AuditFileSpool: Caught exception in consumer thread. Shutdown might be in progress

2017-06-12 23:08:36,655 INFO [pool-5-thread-1] regionserver.ShutdownHook: Shutdown hook finished.

3 REPLIES 3

avatar
Super Guru

ZooKeeper clients maintain a state machine of connected-ness to the quorum. This state machine is handled in the background as HBase (as a client) interacts with ZooKeeper.

Under normal conditions, ZooKeeper clients (HBase services like RegionServer and Master) maintain a "heartbeat" (an RPC) to ZooKeeper every few seconds. This ensures that the client is up to date with the quorum information. When the client is unable to maintain this heartbeat for a significant amount of time (by default, ~30seconds), the ZooKeeper client experiences what is called a session expiration.

The session expiration is a result of the ZK-server-side state being cleaned up because it is assumed that the client (HBase) stopped being interested in using ZooKeeper. This state is normally recoverable automatically; however it typically means that HBase is about to stop. Because there are so many operations inside of HBase which rely on using ZooKeeper, it cannot run if it has a prolonged period of time where it cannot access ZooKeeper.

All of this typically occurs after a unexpected "pause" of the HBase service, due to load on HBase, the machine running HBase, or a host of other issues. Investigate if there were Java GC issues at the time which caused HBase to stop responding for a certain amount of time.

avatar
Explorer

Thanks Josh,

We suspect there could be two reasons for this issue;

1)Zookeeper session timeout is less

2)Need to Tune GC

Could you please share some documents on above two to resolve this issue?