Support Questions

Find answers, ask questions, and share your expertise
Announcements
Welcome to the upgraded Community! Read this blog to see What’s New!

I have running a 12 node cluster, with Hbase master and 12 Region Server. can any one please give me a exact reason or RCA or suggestion why HBase region server is down, and is zookeeper client is manadotry on all the HBase Region server to not?

avatar
Contributor

Log file:

2017-06-12 23:06:29,603 ERROR [regionserver/sddsvrwm383.scglobaluat.aduat.com/172.25.12.67:16020] zookeeper.RecoverableZooKeeper: ZooKeeper getChildren failed after 7 attempts

2017-06-12 23:06:29,603 WARN [regionserver/sddsvrwm383.scglobaluat.aduat. com/172.25.12.67:16020] zookeeper.ZKUtil: regionserver:16020-0x35c93984f940fd2, quorum=sddsvrwm369.scglobaluat.aduat. com:2181,sddsvrwm367.scglobaluat.aduat. com:2181,sddsvrwm368.scglobaluat.aduat.com:2181, baseZNode=/hbase-secure Unable to list children of znode /hbase-secure/replication/rs/sddsvrwm383.scglobaluat.aduat. com,16020,1497136332052

org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /hbase-secure/replication/rs/sddsvrwm383.scglobaluat.aduat. com,16020,1497136332052

at org.apache.zookeeper.KeeperException.create(KeeperException.java:127)

at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)

at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1472)

at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.getChildren(RecoverableZooKeeper.java:295)

at org.apache.hadoop.hbase.zookeeper.ZKUtil.listChildrenAndWatchForNewChildren(ZKUtil.java:455)

at org.apache.hadoop.hbase.zookeeper.ZKUtil.listChildrenAndWatchThem(ZKUtil.java:483)

at org.apache.hadoop.hbase.zookeeper.ZKUtil.listChildrenBFSAndWatchThem(ZKUtil.java:1462)

at org.apache.hadoop.hbase.zookeeper.ZKUtil.deleteNodeRecursivelyMultiOrSequential(ZKUtil.java:1384)

at org.apache.hadoop.hbase.zookeeper.ZKUtil.deleteNodeRecursively(ZKUtil.java:1266)

at org.apache.hadoop.hbase.replication.ReplicationQueuesZKImpl.removeAllQueues(ReplicationQueuesZKImpl.java:196)

at org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.join(ReplicationSourceManager.java:302)

at org.apache.hadoop.hbase.replication.regionserver.Replication.join(Replication.java:202)

at org.apache.hadoop.hbase.replication.regionserver.Replication.stopReplicationService(Replication.java:194)

at org.apache.hadoop.hbase.regionserver.HRegionServer.stopServiceThreads(HRegionServer.java:2163)

at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:1090)

at java.lang.Thread.run(Thread.java:745)

2017-06-12 23:06:29,604 ERROR [regionserver/sddsvrwm383.scglobaluat.aduat.com/172.25.12.67:16020] zookeeper.ZooKeeperWatcher: regionserver:16020-0x35c93984f940fd2, quorum=sddsvrwm369.scglobaluat.aduat.com:2181,sddsvrwm367.scglobaluat.aduat.com:2181,sddsvrwm368.scglobaluat.aduat.com:2181, baseZNode=/hbase-secure Received unexpected KeeperException, re-throwing exception

org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /hbase-secure/replication/rs/sddsvrwm383.scglobaluat.aduat. com,16020,1497136332052

at org.apache.zookeeper.KeeperException.create(KeeperException.java:127)

at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)

at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1472)

at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.getChildren(RecoverableZooKeeper.java:295)

at org.apache.hadoop.hbase.zookeeper.ZKUtil.listChildrenAndWatchForNewChildren(ZKUtil.java:455)

at org.apache.hadoop.hbase.zookeeper.ZKUtil.listChildrenAndWatchThem(ZKUtil.java:483)

at org.apache.hadoop.hbase.zookeeper.ZKUtil.listChildrenBFSAndWatchThem(ZKUtil.java:1462)

at org.apache.hadoop.hbase.zookeeper.ZKUtil.deleteNodeRecursivelyMultiOrSequential(ZKUtil.java:1384)

at org.apache.hadoop.hbase.zookeeper.ZKUtil.deleteNodeRecursively(ZKUtil.java:1266)

at org.apache.hadoop.hbase.replication.ReplicationQueuesZKImpl.removeAllQueues(ReplicationQueuesZKImpl.java:196)

at org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.join(ReplicationSourceManager.java:302)

at org.apache.hadoop.hbase.replication.regionserver.Replication.join(Replication.java:202)

at org.apache.hadoop.hbase.replication.regionserver.Replication.stopReplicationService(Replication.java:194)

at org.apache.hadoop.hbase.regionserver.HRegionServer.stopServiceThreads(HRegionServer.java:2163)

at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:1090)

at java.lang.Thread.run(Thread.java:745)

2017-06-12 23:06:29,605 INFO [regionserver/sddsvrwm383.scglobaluat.aduat.com/172.25.12.67:16020] ipc.RpcServer: Stopping server on 16020

2017-06-12 23:06:29,606 INFO [regionserver/sddsvrwm383.scglobaluat.aduat. com/172.25.12.67:16020] token.AuthenticationTokenSecretManager: Stopping leader election, because: SecretManager stopping

2017-06-12 23:06:29,607 INFO [RpcServer.listener,port=16020] ipc.RpcServer: RpcServer.listener,port=16020: stopping

2017-06-12 23:06:29,614 INFO [RpcServer.responder] ipc.RpcServer: RpcServer.responder: stopped

2017-06-12 23:06:29,614 INFO [RpcServer.responder] ipc.RpcServer: RpcServer.responder: stopping

2017-06-12 23:06:29,619 WARN [regionserver/sddsvrwm383.scglobaluat.aduat.com/172.25.12.67:16020] zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper, quorum=sddsvrwm369.scglobaluat.aduat. com:2181,sddsvrwm367.scglobaluat.aduat. com:2181,sddsvrwm368.scglobaluat.aduat. com:2181, exception=org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /hbase-secure/rs/sddsvrwm383.scglobaluat.aduat. com,16020,1497136332052

2017-06-12 23:07:29,625 INFO [HBase-Metrics2-1] impl.MetricsSystemImpl: Stopping HBase metrics system...

2017-06-12 23:07:29,625 INFO [timeline] impl.MetricsSinkAdapter: timeline thread interrupted.

2017-06-12 23:07:29,628 INFO [HBase-Metrics2-1] impl.MetricsSystemImpl: HBase metrics system stopped.

2017-06-12 23:07:29,628 INFO [pool-684-thread-1] timeline.HadoopTimelineMetricsSink: Closing HadoopTimelineMetricSink. Flushing metrics to collector...

2017-06-12 23:07:30,132 INFO [HBase-Metrics2-1] impl.MetricsConfig: loaded properties from hadoop-metrics2-hbase.properties

2017-06-12 23:07:30,148 INFO [HBase-Metrics2-1] timeline.HadoopTimelineMetricsSink: Initializing Timeline metrics sink.

2017-06-12 23:07:30,148 INFO [HBase-Metrics2-1] timeline.HadoopTimelineMetricsSink: Identified hostname = sddsvrwm383.scglobaluat.aduat.com, serviceName = hbase

2017-06-12 23:07:30,148 INFO [HBase-Metrics2-1] timeline.HadoopTimelineMetricsSink: Collector Uri: http://sddsvrwm368.scglobaluat.aduat. com:6188/ws/v1/timeline/metrics

2017-06-12 23:07:30,151 INFO [HBase-Metrics2-1] impl.MetricsSinkAdapter: Sink timeline started

2017-06-12 23:07:30,151 INFO [HBase-Metrics2-1] impl.MetricsSystemImpl: Scheduled snapshot period at 10 second(s).

2017-06-12 23:07:30,151 INFO [HBase-Metrics2-1] impl.MetricsSystemImpl: HBase metrics system started

2017-06-12 23:07:32,620 WARN [regionserver/sddsvrwm383.scglobaluat.aduat. com/172.25.12.67:16020] zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper, quorum=sddsvrwm369.scglobaluat.aduat. com:2181,sddsvrwm367.scglobaluat.aduat. com:2181,sddsvrwm368.scglobaluat.aduat.com:2181, exception=org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /hbase-secure/rs/sddsvrwm383.scglobaluat.aduat. com,16020,1497136332052

2017-06-12 23:08:36,621 WARN [regionserver/sddsvrwm383.scglobaluat.aduat. com/172.25.12.67:16020] zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper, quorum=sddsvrwm369.scglobaluat.aduat. com:2181,sddsvrwm367.scglobaluat.aduat. com:2181,sddsvrwm368.scglobaluat.aduat. com:2181, exception=org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /hbase-secure/rs/sddsvrwm383.scglobaluat.aduat. com,16020,1497136332052

2017-06-12 23:08:36,621 ERROR [regionserver/sddsvrwm383.scglobaluat.aduat. com/172.25.12.67:16020] zookeeper.RecoverableZooKeeper: ZooKeeper delete failed after 7 attempts

2017-06-12 23:08:36,621 WARN [regionserver/sddsvrwm383.scglobaluat.aduat. com/172.25.12.67:16020] regionserver.HRegionServer: Failed deleting my ephemeral node

org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /hbase-secure/rs/sddsvrwm383.scglobaluat.aduat. com,16020,1497136332052

at org.apache.zookeeper.KeeperException.create(KeeperException.java:127)

at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)

at org.apache.zookeeper.ZooKeeper.delete(ZooKeeper.java:873)

at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.delete(RecoverableZooKeeper.java:178)

at org.apache.hadoop.hbase.zookeeper.ZKUtil.deleteNode(ZKUtil.java:1222)

at org.apache.hadoop.hbase.zookeeper.ZKUtil.deleteNode(ZKUtil.java:1211)

at org.apache.hadoop.hbase.regionserver.HRegionServer.deleteMyEphemeralNode(HRegionServer.java:1427)

at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:1098)

at java.lang.Thread.run(Thread.java:745)

2017-06-12 23:08:36,622 INFO [regionserver/sddsvrwm383.scglobaluat.aduat. com/172.25.12.67:16020] regionserver.HRegionServer: stopping server sddsvrwm383.scglobaluat.aduat. com,16020,1497136332052; zookeeper connection closed.

2017-06-12 23:08:36,622 INFO [regionserver/sddsvrwm383.scglobaluat.aduat. com/172.25.12.67:16020] regionserver.HRegionServer: regionserver/sddsvrwm383.scglobaluat.aduat. com/172.25.12.67:16020 exiting

2017-06-12 23:08:36,627 ERROR [main] regionserver.HRegionServerCommandLine: Region server exiting

java.lang.RuntimeException: HRegionServer Aborted

at org.apache.hadoop.hbase.regionserver.HRegionServerCommandLine.start(HRegionServerCommandLine.java:68)

at org.apache.hadoop.hbase.regionserver.HRegionServerCommandLine.run(HRegionServerCommandLine.java:87)

at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)

at org.apache.hadoop.hbase.util.ServerCommandLine.doMain(ServerCommandLine.java:126)

at org.apache.hadoop.hbase.regionserver.HRegionServer.main(HRegionServer.java:2681)

2017-06-12 23:08:36,633 INFO [pool-5-thread-1] provider.AuditProviderFactory: ==> JVMShutdownHook.run()

2017-06-12 23:08:36,633 INFO [pool-5-thread-1] provider.AuditProviderFactory: JVMShutdownHook: Signalling async audit cleanup to start.

2017-06-12 23:08:36,633 INFO [pool-5-thread-1] provider.AuditProviderFactory: JVMShutdownHook: Waiting up to 30 seconds for audit cleanup to finish.

2017-06-12 23:08:36,634 INFO [Ranger async Audit cleanup] provider.AuditProviderFactory: RangerAsyncAuditCleanup: Starting cleanup

2017-06-12 23:08:36,635 INFO [Ranger async Audit cleanup] destination.HDFSAuditDestination: Flush HDFS audit logs completed.....

2017-06-12 23:08:36,635 INFO [Ranger async Audit cleanup] queue.AuditAsyncQueue: Stop called. name=hbaseRegional.async

2017-06-12 23:08:36,635 INFO [Ranger async Audit cleanup] queue.AuditAsyncQueue: Interrupting consumerThread. name=hbaseRegional.async, consumer=hbaseRegional.async.summary

2017-06-12 23:08:36,635 INFO [Ranger async Audit cleanup] provider.AuditProviderFactory: RangerAsyncAuditCleanup: Done cleanup

2017-06-12 23:08:36,635 INFO [Ranger async Audit cleanup] provider.AuditProviderFactory: RangerAsyncAuditCleanup: Waiting to audit cleanup start signal

2017-06-12 23:08:36,635 INFO [pool-5-thread-1] provider.AuditProviderFactory: JVMShutdownHook: Audit cleanup finished after 2 milli seconds

2017-06-12 23:08:36,635 INFO [pool-5-thread-1] provider.AuditProviderFactory: JVMShutdownHook: Interrupting ranger async audit cleanup thread

2017-06-12 23:08:36,635 INFO [pool-5-thread-1] provider.AuditProviderFactory: <== JVMShutdownHook.run()

2017-06-12 23:08:36,635 INFO [Ranger async Audit cleanup] provider.AuditProviderFactory: RangerAsyncAuditCleanup: Interrupted while waiting for audit startCleanup signal! Exiting the thread...

java.lang.InterruptedException

at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:998)

at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)

at java.util.concurrent.Semaphore.acquire(Semaphore.java:312)

at org.apache.ranger.audit.provider.AuditProviderFactory$RangerAsyncAuditCleanup.run(AuditProviderFactory.java:487)

at java.lang.Thread.run(Thread.java:745)

2017-06-12 23:08:36,636 INFO [org.apache.ranger.audit.queue.AuditAsyncQueue0] queue.AuditAsyncQueue: Caught exception in consumer thread. Shutdown might be in progress

2017-06-12 23:08:36,636 INFO [org.apache.ranger.audit.queue.AuditAsyncQueue0] queue.AuditAsyncQueue: Exiting polling loop. name=hbaseRegional.async

2017-06-12 23:08:36,636 INFO [org.apache.ranger.audit.queue.AuditAsyncQueue0] queue.AuditAsyncQueue: Calling to stop consumer. name=hbaseRegional.async, consumer.name=hbaseRegional.async.summary

2017-06-12 23:08:36,636 INFO [org.apache.ranger.audit.queue.AuditAsyncQueue0] queue.AuditSummaryQueue: Stop called. name=hbaseRegional.async.summary

2017-06-12 23:08:36,636 INFO [org.apache.ranger.audit.queue.AuditAsyncQueue0] queue.AuditSummaryQueue: Interrupting consumerThread. name=hbaseRegional.async.summary, consumer=hbaseRegional.async.summary.batch

2017-06-12 23:08:36,636 INFO [org.apache.ranger.audit.queue.AuditAsyncQueue0] queue.AuditAsyncQueue: Exiting consumerThread.run() method. name=hbaseRegional.async

2017-06-12 23:08:36,636 INFO [org.apache.ranger.audit.queue.AuditSummaryQueue0] queue.AuditSummaryQueue: Caught exception in consumer thread. Shutdown might be in progress

2017-06-12 23:08:36,637 INFO [org.apache.ranger.audit.queue.AuditSummaryQueue0] queue.AuditSummaryQueue: Exiting polling loop. name=hbaseRegional.async.summary

2017-06-12 23:08:36,637 INFO [org.apache.ranger.audit.queue.AuditSummaryQueue0] queue.AuditSummaryQueue: Calling to stop consumer. name=hbaseRegional.async.summary, consumer.name=hbaseRegional.async.summary.batch

2017-06-12 23:08:36,637 INFO [org.apache.ranger.audit.queue.AuditSummaryQueue0] queue.AuditBatchQueue: Stop called. name=hbaseRegional.async.summary.batch

2017-06-12 23:08:36,637 INFO [pool-5-thread-1] regionserver.ShutdownHook: Shutdown hook starting; hbase.shutdown.hook=true; fsShutdownHook=org.apache.hadoop.fs.FileSystem$Cache$ClientFinalizer@4f3faa70

2017-06-12 23:08:36,637 INFO [org.apache.ranger.audit.queue.AuditSummaryQueue0] destination.HDFSAuditDestination: Flush HDFS audit logs completed.....

2017-06-12 23:08:36,637 INFO [org.apache.ranger.audit.queue.AuditSummaryQueue0] queue.AuditBatchQueue: Interrupting consumerThread. name=hbaseRegional.async.summary.batch, consumer=hbaseRegional.async.summary.batch.hdfs

2017-06-12 23:08:36,637 INFO [org.apache.ranger.audit.queue.AuditSummaryQueue0] queue.AuditSummaryQueue: Exiting consumerThread.run() method. name=hbaseRegional.async.summary

name=hbaseRegional.async.summary

2017-06-12 23:08:36,638 INFO [pool-5-thread-1] regionserver.ShutdownHook: Starting fs shutdown hook thread.

2017-06-12 23:08:36,637 INFO [org.apache.ranger.audit.queue.AuditBatchQueue0] queue.AuditBatchQueue: Caught exception in consumer thread. Shutdown might be in progress

2017-06-12 23:08:36,638 INFO [org.apache.ranger.audit.queue.AuditBatchQueue0] queue.AuditBatchQueue: Exiting consumerThread. Queue=hbaseRegional.async.summary.batch, dest=hbaseRegional.async.summary.batch.hdfs

2017-06-12 23:08:36,638 INFO [org.apache.ranger.audit.queue.AuditBatchQueue0] queue.AuditBatchQueue: Calling to stop consumer. name=hbaseRegional.async.summary.batch, consumer.name=hbaseRegional.async.summary.batch.hdfs

2017-06-12 23:08:36,651 INFO [org.apache.ranger.audit.queue.AuditBatchQueue0] provider.BaseAuditHandler: Audit Status Log: name=hbaseRegional.async.summary.batch.hdfs, interval=04:18.058 minutes, events=2, succcessCount=2, totalEvents=12, totalSuccessCount=12

2017-06-12 23:08:36,651 INFO [org.apache.ranger.audit.queue.AuditBatchQueue0] queue.AuditFileSpool: Stop called, queueName=hbaseRegional.async.summary.batch, consumer=hbaseRegional.async.summary.batch.hdfs

2017-06-12 23:08:36,652 INFO [org.apache.ranger.audit.queue.AuditBatchQueue0] queue.AuditBatchQueue: Exiting consumerThread.run() method. name=hbaseRegional.async.summary.batch

2017-06-12 23:08:36,651 INFO [hbaseRegional.async.summary.batch_hbaseRegional.async.summary.batch.hdfs_destWriter] queue.AuditFileSpool: Caught exception in consumer thread. Shutdown might be in progress

2017-06-12 23:08:36,655 INFO [pool-5-thread-1] regionserver.ShutdownHook: Shutdown hook finished.

3 REPLIES 3

avatar

ZooKeeper clients maintain a state machine of connected-ness to the quorum. This state machine is handled in the background as HBase (as a client) interacts with ZooKeeper.

Under normal conditions, ZooKeeper clients (HBase services like RegionServer and Master) maintain a "heartbeat" (an RPC) to ZooKeeper every few seconds. This ensures that the client is up to date with the quorum information. When the client is unable to maintain this heartbeat for a significant amount of time (by default, ~30seconds), the ZooKeeper client experiences what is called a session expiration.

The session expiration is a result of the ZK-server-side state being cleaned up because it is assumed that the client (HBase) stopped being interested in using ZooKeeper. This state is normally recoverable automatically; however it typically means that HBase is about to stop. Because there are so many operations inside of HBase which rely on using ZooKeeper, it cannot run if it has a prolonged period of time where it cannot access ZooKeeper.

All of this typically occurs after a unexpected "pause" of the HBase service, due to load on HBase, the machine running HBase, or a host of other issues. Investigate if there were Java GC issues at the time which caused HBase to stop responding for a certain amount of time.

avatar
Explorer

Thanks Josh,

We suspect there could be two reasons for this issue;

1)Zookeeper session timeout is less

2)Need to Tune GC

Could you please share some documents on above two to resolve this issue?

Labels