Created 01-21-2018 07:11 AM
Active Hbase master goes down & and failover happened by making standby Hbase master as up.But after some time all the region servers goes down one by one and standby Hbase master also goes down and finally whole HBase cluster goes offline.We have started all the services make up.
Active Hbase Master LOGS : -
2018-01-17 16:22:29,895 ERROR [master/post-om2.vodafone.flytxt.com/10.88.8.79:16000] master.ActiveMasterManager: master:16000-0x35ee3bbff600001, quorum=post-om2.vodafone.flytxt.com:2181,post-om1.vodafone.flytxt.com:2181,post-os1.vodafone.flytxt.com:2181, baseZNode=/hbase Error deleting our own master address node org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /hbase/master at org.apache.zookeeper.KeeperException.create(KeeperException.java:127) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1155) at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.getData(RecoverableZooKeeper.java:359) at org.apache.hadoop.hbase.zookeeper.ZKUtil.getData(ZKUtil.java:745) at org.apache.hadoop.hbase.zookeeper.MasterAddressTracker.getMasterAddress(MasterAddressTracker.java:148) at org.apache.hadoop.hbase.master.ActiveMasterManager.stop(ActiveMasterManager.java:267) at org.apache.hadoop.hbase.master.HMaster.stopServiceThreads(HMaster.java:1145) at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:1071) at java.lang.Thread.run(Thread.java:744) 2018-01-17 16:22:29,895 INFO [master/post-om2.vodafone.flytxt.com/10.88.8.79:16000] client.ConnectionManager$HConnectionImplementation: Closing zookeeper sessionid=0x15ee3bbf9910003 2018-01-17 16:22:29,897 INFO [master/post-om2.vodafone.flytxt.com/10.88.8.79:16000] zookeeper.ZooKeeper: Session: 0x15ee3bbf9910003 closed 2018-01-17 16:22:29,897 INFO [post-om2:16000.activeMasterManager-EventThread] zookeeper.ClientCnxn: EventThread shut down 2018-01-17 16:22:29,899 INFO [master/post-om2.vodafone.flytxt.com/10.88.8.79:16000] flush.MasterFlushTableProcedureManager: stop: server shutting down. 2018-01-17 16:22:29,899 INFO [master/post-om2.vodafone.flytxt.com/10.88.8.79:16000] ipc.RpcServer: Stopping server on 16000 2018-01-17 16:22:29,900 INFO [RpcServer.listener,port=16000] ipc.RpcServer: RpcServer.listener,port=16000: stopping
Active /old Standby HBase Master logs : -
2018-01-17 16:23:06,025 INFO [post-om1:16000.activeMasterManager] master.ActiveMasterManager: Registered Active Master=post-om1.vodafone.flytxt.com,16000,1507059524710 2018-01-17 16:24:46,275 INFO [post-om1:16000.activeMasterManager] master.AssignmentManager: Joined the cluster in 119ms, failover=true Post-os10 goes down logs in the post-om1: 2018-01-17 18:45:35,364 ERROR [PriorityRpcServer.handler=11,queue=1,port=16000] master.MasterRpcServices: Region server post-os10.vodafone.flytxt.com,16020,1507059587301 reported a fatal error: ABORTING region server post-os10.vodafone.flytxt.com,16020,1507059587301: IOE in log roller Cause: java.io.IOException: cannot get log writer at org.apache.hadoop.hbase.wal.DefaultWALProvider.createWriter(DefaultWALProvider.java:365) at org.apache.hadoop.hbase.regionserver.wal.FSHLog.createWriterInstance(FSHLog.java:746) at org.apache.hadoop.hbase.regionserver.wal.FSHLog.rollWriter(FSHLog.java:711) at org.apache.hadoop.hbase.regionserver.LogRoller.run(LogRoller.java:137) at java.lang.Thread.run(Thread.java:744) Caused by: java.io.FileNotFoundException: Parent directory doesn't exist: /apps/hbase/data/WALs/post-os10.vodafone.flytxt.com,16020,1507059587301
2018-01-17 18:54:21,800 ERROR [PriorityRpcServer.handler=11,queue=1,port=16000] master.MasterRpcServices: Region server post-os5.vodafone.flytxt.com,16020,1513481333569 reported a fatal error: ABORTING region server post-os5.vodafone.flytxt.com,16020,1513481333569: IOE in log roller Cause: java.io.IOException: cannot get log writer
Sample Region server logs : -
2018-01-17 18:52:10,311 INFO [regionserver/post-os5.vodafone.flytxt.com/10.88.8.76:16020-SendThread(post-om1.vodafone.flytxt.com:2181)] zookeeper.ClientCnxn: Opening socket connection to server post-om1.vodafone.flytxt.com/10.88.8.71:2181. Will not attempt to authenticate using SASL (unknown error) 2018-01-17 18:52:10,311 INFO [regionserver/post-os5.vodafone.flytxt.com/10.88.8.76:16020-SendThread(post-om1.vodafone.flytxt.com:2181)] zookeeper.ClientCnxn: Socket connection established to post-om1.vodafone.flytxt.com/10.88.8.71:2181, initiating session 2018-01-17 18:52:10,313 INFO [regionserver/post-os5.vodafone.flytxt.com/10.88.8.76:16020-SendThread(post-om1.vodafone.flytxt.com:2181)] zookeeper.ClientCnxn: Unable to reconnect to ZooKeeper service, session 0x15ee3bbf9933b39 has expired, closing socket connection 2018-01-17 18:52:10,313 WARN [regionserver/post-os5.vodafone.flytxt.com/10.88.8.76:16020-EventThread] client.ConnectionManager$HConnectionImplementation: This client just lost it's session with ZooKeeper, closing it. It will be recreated next time someone needs it org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeperWatcher.java:606) at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:517) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:522) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498) 2018-01-17 18:52:10,313 INFO [regionserver/post-os5.vodafone.flytxt.com/10.88.8.76:16020-EventThread] client.ConnectionManager$HConnectionImplementation: Closing zookeeper sessionid=0x15ee3bbf9933b39 2018-01-17 18:52:10,313 INFO [regionserver/post-os5.vodafone.flytxt.com/10.88.8.76:16020-EventThread] zookeeper.ClientCnxn: EventThread shut down 2018-01-17 18:52:10,335 INFO [main-SendThread(post-om2.vodafone.flytxt.com:2181)] zookeeper.ClientCnxn: Opening socket connection to server post-om2.vodafone.flytxt.com/10.88.8.79:2181. Will not attempt to authenticate using SASL (unknown error)
Created 01-22-2018 10:36 AM
check hbase gc logs. if there is a GC pause that correlates with these log lines then you got your problem.
While master/RS is in GC, it fails to send heatbeat to zookeeper and the zookeeper node expires. In that case you can increase zookeeper timeout and increase ticktime in zookeeper config.
this is not directly related but do checkout: https://superuser.blog/hbase-dead-regionserver/