Created 06-14-2017 02:58 AM
Log file:
2017-06-12 23:06:29,603 ERROR [regionserver/sddsvrwm383.scglobaluat.aduat.com/172.25.12.67:16020] zookeeper.RecoverableZooKeeper: ZooKeeper getChildren failed after 7 attempts
2017-06-12 23:06:29,603 WARN [regionserver/sddsvrwm383.scglobaluat.aduat. com/172.25.12.67:16020] zookeeper.ZKUtil: regionserver:16020-0x35c93984f940fd2, quorum=sddsvrwm369.scglobaluat.aduat. com:2181,sddsvrwm367.scglobaluat.aduat. com:2181,sddsvrwm368.scglobaluat.aduat.com:2181, baseZNode=/hbase-secure Unable to list children of znode /hbase-secure/replication/rs/sddsvrwm383.scglobaluat.aduat. com,16020,1497136332052
org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /hbase-secure/replication/rs/sddsvrwm383.scglobaluat.aduat. com,16020,1497136332052
at org.apache.zookeeper.KeeperException.create(KeeperException.java:127)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1472)
at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.getChildren(RecoverableZooKeeper.java:295)
at org.apache.hadoop.hbase.zookeeper.ZKUtil.listChildrenAndWatchForNewChildren(ZKUtil.java:455)
at org.apache.hadoop.hbase.zookeeper.ZKUtil.listChildrenAndWatchThem(ZKUtil.java:483)
at org.apache.hadoop.hbase.zookeeper.ZKUtil.listChildrenBFSAndWatchThem(ZKUtil.java:1462)
at org.apache.hadoop.hbase.zookeeper.ZKUtil.deleteNodeRecursivelyMultiOrSequential(ZKUtil.java:1384)
at org.apache.hadoop.hbase.zookeeper.ZKUtil.deleteNodeRecursively(ZKUtil.java:1266)
at org.apache.hadoop.hbase.replication.ReplicationQueuesZKImpl.removeAllQueues(ReplicationQueuesZKImpl.java:196)
at org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.join(ReplicationSourceManager.java:302)
at org.apache.hadoop.hbase.replication.regionserver.Replication.join(Replication.java:202)
at org.apache.hadoop.hbase.replication.regionserver.Replication.stopReplicationService(Replication.java:194)
at org.apache.hadoop.hbase.regionserver.HRegionServer.stopServiceThreads(HRegionServer.java:2163)
at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:1090)
at java.lang.Thread.run(Thread.java:745)
2017-06-12 23:06:29,604 ERROR [regionserver/sddsvrwm383.scglobaluat.aduat.com/172.25.12.67:16020] zookeeper.ZooKeeperWatcher: regionserver:16020-0x35c93984f940fd2, quorum=sddsvrwm369.scglobaluat.aduat.com:2181,sddsvrwm367.scglobaluat.aduat.com:2181,sddsvrwm368.scglobaluat.aduat.com:2181, baseZNode=/hbase-secure Received unexpected KeeperException, re-throwing exception
org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /hbase-secure/replication/rs/sddsvrwm383.scglobaluat.aduat. com,16020,1497136332052
at org.apache.zookeeper.KeeperException.create(KeeperException.java:127)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1472)
at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.getChildren(RecoverableZooKeeper.java:295)
at org.apache.hadoop.hbase.zookeeper.ZKUtil.listChildrenAndWatchForNewChildren(ZKUtil.java:455)
at org.apache.hadoop.hbase.zookeeper.ZKUtil.listChildrenAndWatchThem(ZKUtil.java:483)
at org.apache.hadoop.hbase.zookeeper.ZKUtil.listChildrenBFSAndWatchThem(ZKUtil.java:1462)
at org.apache.hadoop.hbase.zookeeper.ZKUtil.deleteNodeRecursivelyMultiOrSequential(ZKUtil.java:1384)
at org.apache.hadoop.hbase.zookeeper.ZKUtil.deleteNodeRecursively(ZKUtil.java:1266)
at org.apache.hadoop.hbase.replication.ReplicationQueuesZKImpl.removeAllQueues(ReplicationQueuesZKImpl.java:196)
at org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.join(ReplicationSourceManager.java:302)
at org.apache.hadoop.hbase.replication.regionserver.Replication.join(Replication.java:202)
at org.apache.hadoop.hbase.replication.regionserver.Replication.stopReplicationService(Replication.java:194)
at org.apache.hadoop.hbase.regionserver.HRegionServer.stopServiceThreads(HRegionServer.java:2163)
at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:1090)
at java.lang.Thread.run(Thread.java:745)
2017-06-12 23:06:29,605 INFO [regionserver/sddsvrwm383.scglobaluat.aduat.com/172.25.12.67:16020] ipc.RpcServer: Stopping server on 16020
2017-06-12 23:06:29,606 INFO [regionserver/sddsvrwm383.scglobaluat.aduat. com/172.25.12.67:16020] token.AuthenticationTokenSecretManager: Stopping leader election, because: SecretManager stopping
2017-06-12 23:06:29,607 INFO [RpcServer.listener,port=16020] ipc.RpcServer: RpcServer.listener,port=16020: stopping
2017-06-12 23:06:29,614 INFO [RpcServer.responder] ipc.RpcServer: RpcServer.responder: stopped
2017-06-12 23:06:29,614 INFO [RpcServer.responder] ipc.RpcServer: RpcServer.responder: stopping
2017-06-12 23:06:29,619 WARN [regionserver/sddsvrwm383.scglobaluat.aduat.com/172.25.12.67:16020] zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper, quorum=sddsvrwm369.scglobaluat.aduat. com:2181,sddsvrwm367.scglobaluat.aduat. com:2181,sddsvrwm368.scglobaluat.aduat. com:2181, exception=org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /hbase-secure/rs/sddsvrwm383.scglobaluat.aduat. com,16020,1497136332052
2017-06-12 23:07:29,625 INFO [HBase-Metrics2-1] impl.MetricsSystemImpl: Stopping HBase metrics system...
2017-06-12 23:07:29,625 INFO [timeline] impl.MetricsSinkAdapter: timeline thread interrupted.
2017-06-12 23:07:29,628 INFO [HBase-Metrics2-1] impl.MetricsSystemImpl: HBase metrics system stopped.
2017-06-12 23:07:29,628 INFO [pool-684-thread-1] timeline.HadoopTimelineMetricsSink: Closing HadoopTimelineMetricSink. Flushing metrics to collector...
2017-06-12 23:07:30,132 INFO [HBase-Metrics2-1] impl.MetricsConfig: loaded properties from hadoop-metrics2-hbase.properties
2017-06-12 23:07:30,148 INFO [HBase-Metrics2-1] timeline.HadoopTimelineMetricsSink: Initializing Timeline metrics sink.
2017-06-12 23:07:30,148 INFO [HBase-Metrics2-1] timeline.HadoopTimelineMetricsSink: Identified hostname = sddsvrwm383.scglobaluat.aduat.com, serviceName = hbase
2017-06-12 23:07:30,148 INFO [HBase-Metrics2-1] timeline.HadoopTimelineMetricsSink: Collector Uri: http://sddsvrwm368.scglobaluat.aduat. com:6188/ws/v1/timeline/metrics
2017-06-12 23:07:30,151 INFO [HBase-Metrics2-1] impl.MetricsSinkAdapter: Sink timeline started
2017-06-12 23:07:30,151 INFO [HBase-Metrics2-1] impl.MetricsSystemImpl: Scheduled snapshot period at 10 second(s).
2017-06-12 23:07:30,151 INFO [HBase-Metrics2-1] impl.MetricsSystemImpl: HBase metrics system started
2017-06-12 23:07:32,620 WARN [regionserver/sddsvrwm383.scglobaluat.aduat. com/172.25.12.67:16020] zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper, quorum=sddsvrwm369.scglobaluat.aduat. com:2181,sddsvrwm367.scglobaluat.aduat. com:2181,sddsvrwm368.scglobaluat.aduat.com:2181, exception=org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /hbase-secure/rs/sddsvrwm383.scglobaluat.aduat. com,16020,1497136332052
2017-06-12 23:08:36,621 WARN [regionserver/sddsvrwm383.scglobaluat.aduat. com/172.25.12.67:16020] zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper, quorum=sddsvrwm369.scglobaluat.aduat. com:2181,sddsvrwm367.scglobaluat.aduat. com:2181,sddsvrwm368.scglobaluat.aduat. com:2181, exception=org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /hbase-secure/rs/sddsvrwm383.scglobaluat.aduat. com,16020,1497136332052
2017-06-12 23:08:36,621 ERROR [regionserver/sddsvrwm383.scglobaluat.aduat. com/172.25.12.67:16020] zookeeper.RecoverableZooKeeper: ZooKeeper delete failed after 7 attempts
2017-06-12 23:08:36,621 WARN [regionserver/sddsvrwm383.scglobaluat.aduat. com/172.25.12.67:16020] regionserver.HRegionServer: Failed deleting my ephemeral node
org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /hbase-secure/rs/sddsvrwm383.scglobaluat.aduat. com,16020,1497136332052
at org.apache.zookeeper.KeeperException.create(KeeperException.java:127)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
at org.apache.zookeeper.ZooKeeper.delete(ZooKeeper.java:873)
at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.delete(RecoverableZooKeeper.java:178)
at org.apache.hadoop.hbase.zookeeper.ZKUtil.deleteNode(ZKUtil.java:1222)
at org.apache.hadoop.hbase.zookeeper.ZKUtil.deleteNode(ZKUtil.java:1211)
at org.apache.hadoop.hbase.regionserver.HRegionServer.deleteMyEphemeralNode(HRegionServer.java:1427)
at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:1098)
at java.lang.Thread.run(Thread.java:745)
2017-06-12 23:08:36,622 INFO [regionserver/sddsvrwm383.scglobaluat.aduat. com/172.25.12.67:16020] regionserver.HRegionServer: stopping server sddsvrwm383.scglobaluat.aduat. com,16020,1497136332052; zookeeper connection closed.
2017-06-12 23:08:36,622 INFO [regionserver/sddsvrwm383.scglobaluat.aduat. com/172.25.12.67:16020] regionserver.HRegionServer: regionserver/sddsvrwm383.scglobaluat.aduat. com/172.25.12.67:16020 exiting
2017-06-12 23:08:36,627 ERROR [main] regionserver.HRegionServerCommandLine: Region server exiting
java.lang.RuntimeException: HRegionServer Aborted
at org.apache.hadoop.hbase.regionserver.HRegionServerCommandLine.start(HRegionServerCommandLine.java:68)
at org.apache.hadoop.hbase.regionserver.HRegionServerCommandLine.run(HRegionServerCommandLine.java:87)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
at org.apache.hadoop.hbase.util.ServerCommandLine.doMain(ServerCommandLine.java:126)
at org.apache.hadoop.hbase.regionserver.HRegionServer.main(HRegionServer.java:2681)
2017-06-12 23:08:36,633 INFO [pool-5-thread-1] provider.AuditProviderFactory: ==> JVMShutdownHook.run()
2017-06-12 23:08:36,633 INFO [pool-5-thread-1] provider.AuditProviderFactory: JVMShutdownHook: Signalling async audit cleanup to start.
2017-06-12 23:08:36,633 INFO [pool-5-thread-1] provider.AuditProviderFactory: JVMShutdownHook: Waiting up to 30 seconds for audit cleanup to finish.
2017-06-12 23:08:36,634 INFO [Ranger async Audit cleanup] provider.AuditProviderFactory: RangerAsyncAuditCleanup: Starting cleanup
2017-06-12 23:08:36,635 INFO [Ranger async Audit cleanup] destination.HDFSAuditDestination: Flush HDFS audit logs completed.....
2017-06-12 23:08:36,635 INFO [Ranger async Audit cleanup] queue.AuditAsyncQueue: Stop called. name=hbaseRegional.async
2017-06-12 23:08:36,635 INFO [Ranger async Audit cleanup] queue.AuditAsyncQueue: Interrupting consumerThread. name=hbaseRegional.async, consumer=hbaseRegional.async.summary
2017-06-12 23:08:36,635 INFO [Ranger async Audit cleanup] provider.AuditProviderFactory: RangerAsyncAuditCleanup: Done cleanup
2017-06-12 23:08:36,635 INFO [Ranger async Audit cleanup] provider.AuditProviderFactory: RangerAsyncAuditCleanup: Waiting to audit cleanup start signal
2017-06-12 23:08:36,635 INFO [pool-5-thread-1] provider.AuditProviderFactory: JVMShutdownHook: Audit cleanup finished after 2 milli seconds
2017-06-12 23:08:36,635 INFO [pool-5-thread-1] provider.AuditProviderFactory: JVMShutdownHook: Interrupting ranger async audit cleanup thread
2017-06-12 23:08:36,635 INFO [pool-5-thread-1] provider.AuditProviderFactory: <== JVMShutdownHook.run()
2017-06-12 23:08:36,635 INFO [Ranger async Audit cleanup] provider.AuditProviderFactory: RangerAsyncAuditCleanup: Interrupted while waiting for audit startCleanup signal! Exiting the thread...
java.lang.InterruptedException
at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:998)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
at java.util.concurrent.Semaphore.acquire(Semaphore.java:312)
at org.apache.ranger.audit.provider.AuditProviderFactory$RangerAsyncAuditCleanup.run(AuditProviderFactory.java:487)
at java.lang.Thread.run(Thread.java:745)
2017-06-12 23:08:36,636 INFO [org.apache.ranger.audit.queue.AuditAsyncQueue0] queue.AuditAsyncQueue: Caught exception in consumer thread. Shutdown might be in progress
2017-06-12 23:08:36,636 INFO [org.apache.ranger.audit.queue.AuditAsyncQueue0] queue.AuditAsyncQueue: Exiting polling loop. name=hbaseRegional.async
2017-06-12 23:08:36,636 INFO [org.apache.ranger.audit.queue.AuditAsyncQueue0] queue.AuditAsyncQueue: Calling to stop consumer. name=hbaseRegional.async, consumer.name=hbaseRegional.async.summary
2017-06-12 23:08:36,636 INFO [org.apache.ranger.audit.queue.AuditAsyncQueue0] queue.AuditSummaryQueue: Stop called. name=hbaseRegional.async.summary
2017-06-12 23:08:36,636 INFO [org.apache.ranger.audit.queue.AuditAsyncQueue0] queue.AuditSummaryQueue: Interrupting consumerThread. name=hbaseRegional.async.summary, consumer=hbaseRegional.async.summary.batch
2017-06-12 23:08:36,636 INFO [org.apache.ranger.audit.queue.AuditAsyncQueue0] queue.AuditAsyncQueue: Exiting consumerThread.run() method. name=hbaseRegional.async
2017-06-12 23:08:36,636 INFO [org.apache.ranger.audit.queue.AuditSummaryQueue0] queue.AuditSummaryQueue: Caught exception in consumer thread. Shutdown might be in progress
2017-06-12 23:08:36,637 INFO [org.apache.ranger.audit.queue.AuditSummaryQueue0] queue.AuditSummaryQueue: Exiting polling loop. name=hbaseRegional.async.summary
2017-06-12 23:08:36,637 INFO [org.apache.ranger.audit.queue.AuditSummaryQueue0] queue.AuditSummaryQueue: Calling to stop consumer. name=hbaseRegional.async.summary, consumer.name=hbaseRegional.async.summary.batch
2017-06-12 23:08:36,637 INFO [org.apache.ranger.audit.queue.AuditSummaryQueue0] queue.AuditBatchQueue: Stop called. name=hbaseRegional.async.summary.batch
2017-06-12 23:08:36,637 INFO [pool-5-thread-1] regionserver.ShutdownHook: Shutdown hook starting; hbase.shutdown.hook=true; fsShutdownHook=org.apache.hadoop.fs.FileSystem$Cache$ClientFinalizer@4f3faa70
2017-06-12 23:08:36,637 INFO [org.apache.ranger.audit.queue.AuditSummaryQueue0] destination.HDFSAuditDestination: Flush HDFS audit logs completed.....
2017-06-12 23:08:36,637 INFO [org.apache.ranger.audit.queue.AuditSummaryQueue0] queue.AuditBatchQueue: Interrupting consumerThread. name=hbaseRegional.async.summary.batch, consumer=hbaseRegional.async.summary.batch.hdfs
2017-06-12 23:08:36,637 INFO [org.apache.ranger.audit.queue.AuditSummaryQueue0] queue.AuditSummaryQueue: Exiting consumerThread.run() method. name=hbaseRegional.async.summary
name=hbaseRegional.async.summary
2017-06-12 23:08:36,638 INFO [pool-5-thread-1] regionserver.ShutdownHook: Starting fs shutdown hook thread.
2017-06-12 23:08:36,637 INFO [org.apache.ranger.audit.queue.AuditBatchQueue0] queue.AuditBatchQueue: Caught exception in consumer thread. Shutdown might be in progress
2017-06-12 23:08:36,638 INFO [org.apache.ranger.audit.queue.AuditBatchQueue0] queue.AuditBatchQueue: Exiting consumerThread. Queue=hbaseRegional.async.summary.batch, dest=hbaseRegional.async.summary.batch.hdfs
2017-06-12 23:08:36,638 INFO [org.apache.ranger.audit.queue.AuditBatchQueue0] queue.AuditBatchQueue: Calling to stop consumer. name=hbaseRegional.async.summary.batch, consumer.name=hbaseRegional.async.summary.batch.hdfs
2017-06-12 23:08:36,651 INFO [org.apache.ranger.audit.queue.AuditBatchQueue0] provider.BaseAuditHandler: Audit Status Log: name=hbaseRegional.async.summary.batch.hdfs, interval=04:18.058 minutes, events=2, succcessCount=2, totalEvents=12, totalSuccessCount=12
2017-06-12 23:08:36,651 INFO [org.apache.ranger.audit.queue.AuditBatchQueue0] queue.AuditFileSpool: Stop called, queueName=hbaseRegional.async.summary.batch, consumer=hbaseRegional.async.summary.batch.hdfs
2017-06-12 23:08:36,652 INFO [org.apache.ranger.audit.queue.AuditBatchQueue0] queue.AuditBatchQueue: Exiting consumerThread.run() method. name=hbaseRegional.async.summary.batch
2017-06-12 23:08:36,651 INFO [hbaseRegional.async.summary.batch_hbaseRegional.async.summary.batch.hdfs_destWriter] queue.AuditFileSpool: Caught exception in consumer thread. Shutdown might be in progress
2017-06-12 23:08:36,655 INFO [pool-5-thread-1] regionserver.ShutdownHook: Shutdown hook finished.
Created 06-14-2017 03:39 PM
ZooKeeper clients maintain a state machine of connected-ness to the quorum. This state machine is handled in the background as HBase (as a client) interacts with ZooKeeper.
Under normal conditions, ZooKeeper clients (HBase services like RegionServer and Master) maintain a "heartbeat" (an RPC) to ZooKeeper every few seconds. This ensures that the client is up to date with the quorum information. When the client is unable to maintain this heartbeat for a significant amount of time (by default, ~30seconds), the ZooKeeper client experiences what is called a session expiration.
The session expiration is a result of the ZK-server-side state being cleaned up because it is assumed that the client (HBase) stopped being interested in using ZooKeeper. This state is normally recoverable automatically; however it typically means that HBase is about to stop. Because there are so many operations inside of HBase which rely on using ZooKeeper, it cannot run if it has a prolonged period of time where it cannot access ZooKeeper.
All of this typically occurs after a unexpected "pause" of the HBase service, due to load on HBase, the machine running HBase, or a host of other issues. Investigate if there were Java GC issues at the time which caused HBase to stop responding for a certain amount of time.
Created 06-16-2017 01:28 PM
Thanks Josh,
We suspect there could be two reasons for this issue;
1)Zookeeper session timeout is less
2)Need to Tune GC
Could you please share some documents on above two to resolve this issue?
Created 06-16-2017 06:53 PM