Created 05-04-2017 08:48 AM
Almost all region servers in our production are dying with some lease expired exception followed by YouAreDeadException. Hbase version using is 1.1.2
Logs are as given below
2017-04-28 11:25:10,793 ERROR [sync.2] wal.FSHLog: Error syncing, request close of wal org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException): No lease on /apps/hbase/data/WALs/node5,16020,1487446061404-splitting/node5%2C16020%2C1487446061404.default.1493358867035 (inode 4423489): File is not open for writing. [Lease. Holder: DFSClient_NONMAPREDUCE_1688735303_1, pendingcreates: 3] at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:3445) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalDatanode(FSNamesystem.java:3347) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getAdditionalDatanode(NameNodeRpcServer.java:759) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getAdditionalDatanode(ClientNamenodeProtocolServerSideTranslatorPB.java:515) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2137) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2133) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2131) at org.apache.hadoop.ipc.Client.call(Client.java:1427) at org.apache.hadoop.ipc.Client.call(Client.java:1358) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229) at com.sun.proxy.$Proxy19.getAdditionalDatanode(Unknown Source) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getAdditionalDatanode(ClientNamenodeProtocolTranslatorPB.java:443) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) at com.sun.proxy.$Proxy20.getAdditionalDatanode(Unknown Source) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.hbase.fs.HFileSystem$1.invoke(HFileSystem.java:279) at com.sun.proxy.$Proxy21.getAdditionalDatanode(Unknown Source) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:1010) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1165) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:909) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:412) 2017-04-28 11:25:10,793 WARN [B.defaultRpcServer.handler=17,queue=2,port=16020] ipc.RpcServer: (responseTooSlow): {"processingtimems":41472,"call":"Scan(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanRequest)","client":"162.168.6.235:60011","starttimems":1493358869320,"queuetimems":0,"class":"HRegionServer","responsesize":416,"method":"Scan"} 2017-04-28 11:25:10,793 WARN [B.defaultRpcServer.handler=16,queue=1,port=16020] ipc.RpcServer: (responseTooSlow): {"processingtimems":41495,"call":"Scan(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanRequest)","client":"162.168.6.234:48579","starttimems":1493358869298,"queuetimems":0,"class":"HRegionServer","responsesize":416,"method":"Scan"} 2017-04-28 11:25:10,795 WARN [B.defaultRpcServer.handler=2,queue=2,port=16020] ipc.RpcServer: (responseTooSlow): {"processingtimems":41695,"call":"Scan(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanRequest)","client":"162.168.6.235:60011","starttimems":1493358869100,"queuetimems":0,"class":"HRegionServer","responsesize":416,"method":"Scan"} 2017-04-28 11:25:10,802 WARN [sync.4] hdfs.DFSClient: Error while syncing org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException): No lease on /apps/hbase/data/WALs/node5,16020,1487446061404-splitting/node5%2C16020%2C1487446061404.default.1493358867035 (inode 4423489): File is not open for writing. [Lease. Holder: DFSClient_NONMAPREDUCE_1688735303_1, pendingcreates: 3] at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:3445) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalDatanode(FSNamesystem.java:3347) 2017-04-28 11:25:10,965 FATAL [regionserver/node5/162.168.6.235:16020] regionserver.HRegionServer: ABORTING region server node5,16020,1487446061404: org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected; currently processing node5,16020,1487446061404 as dead server at org.apache.hadoop.hbase.master.ServerManager.checkIsDead(ServerManager.java:413) at org.apache.hadoop.hbase.master.ServerManager.regionServerReport(ServerManager.java:318) at org.apache.hadoop.hbase.master.MasterRpcServices.regionServerReport(MasterRpcServices.java:295) at org.apache.hadoop.hbase.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$2.callBlockingMethod(RegionServerStatusProtos.java:8617) at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2114) at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:101) at org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:130) at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:107) at java.lang.Thread.run(Thread.java:744) 2017-04-28 11:25:10,951 WARN [B.defaultRpcServer.handler=29,queue=2,port=16020] ipc.RpcServer: (responseTooSlow): {"processingtimems":41653,"call":"Scan(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanRequest)","client":"162.168.6.235:59900","starttimems":1493358869298,"queuetimems":0,"class":"HRegionServer","responsesize":416,"method":"Scan"} 2017-04-28 11:25:10,951 WARN [B.defaultRpcServer.handler=19,queue=1,port=16020] ipc.RpcServer: (responseTooSlow): {"processingtimems":41820,"call":"Get(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$GetRequest)","client":"162.168.6.229:52499","starttimems":1493358869131,"queuetimems":0,"class":"HRegionServer","responsesize":3369,"method":"Get"} 2017-04-28 11:25:10,965 FATAL [regionserver/node5/162.168.6.235:16020] regionserver.HRegionServer: ABORTING region server node5,16020,1487446061404: org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected; currently processing node5,16020,1487446061404 as dead server at org.apache.hadoop.hbase.master.ServerManager.checkIsDead(ServerManager.java:413) at org.apache.hadoop.hbase.master.ServerManager.regionServerReport(ServerManager.java:318) at org.apache.hadoop.hbase.master.MasterRpcServices.regionServerReport(MasterRpcServices.java:295) at org.apache.hadoop.hbase.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$2.callBlockingMethod(RegionServerStatusProtos.java:8617) at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2114) at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:101) at org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:130) at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:107) at java.lang.Thread.run(Thread.java:744) org.apache.hadoop.hbase.YouAreDeadException: org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected; currently processing node5,16020,1487446061404 as dead server at org.apache.hadoop.hbase.master.ServerManager.checkIsDead(ServerManager.java:413) at org.apache.hadoop.hbase.master.ServerManager.regionServerReport(ServerManager.java:318) at org.apache.hadoop.hbase.master.MasterRpcServices.regionServerReport(MasterRpcServices.java:295) at org.apache.hadoop.hbase.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$2.callBlockingMethod(RegionServerStatusProtos.java:8617) at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2114) at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:101) at org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:130) at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:107) at java.lang.Thread.run(Thread.java:744) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:526) at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106) at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:95) at org.apache.hadoop.hbase.protobuf.ProtobufUtil.getRemoteException(ProtobufUtil.java:325) at org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionServerReport(HRegionServer.java:1141) at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:949) at java.lang.Thread.run(Thread.java:744) Caused by: org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.YouAreDeadException): org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected; currently processing node5,16020,1487446061404 as dead server at org.apache.hadoop.hbase.master.ServerManager.checkIsDead(ServerManager.java:413) at org.apache.hadoop.hbase.master.ServerManager.regionServerReport(ServerManager.java:318) at org.apache.hadoop.hbase.master.MasterRpcServices.regionServerReport(MasterRpcServices.java:295) at org.apache.hadoop.hbase.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$2.callBlockingMethod(RegionServerStatusProtos.java:8617) at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2114)
Created 05-04-2017 04:14 PM
Check the Master log -- that is the entity which believes this RegionServer should be dead.
Most likely, the HBase master received a notification from ZooKeeper that this RegionServer lost its ZooKeeper node, the Master marked the RegionServer as dead and started re-assigning its regions. However, the RegionServer was yet to realize that it had lost its own lock (classic distributed system problem), and kept trying to perform actions.
I would bet that you see a message in the RS log about losing its ZK lock or a SessionExpiration. Commonly, this happens due to a JVM pause from garbage collection. All of this information will be encapsulated in the log -- you will need to confirm that is what happened via reading.
Created 05-29-2017 07:05 AM
Adding to @Josh Elser 's answer, if you choose to increase timeout for zk session by increasing ticktime, adding those values in HBase conf won't work. It'll take zk ticktime in account for calculation (ticktime * 20) I was facing the same problem and later I wrote about it here: https://superuser.blog/hbase-dead-regionserver/
Created 05-29-2017 04:47 PM
Your comment is a little unclear in what you pasted here, so let me try to clarify.
The *maximum* allowed ZooKeeper session timeout is defined to be twenty times the ZK tickTime. By default on HDP, that would be a maximum allowed session timeout of 40s. If you feel that you must increase this timeout further, you would have to increase the tickTime as well.
Users should *not* do this without considering the consequences. Increasing the ZooKeeper tickTime will increase the amount of time that ZooKeeper processes and distributes notifications. In other words, it will increase the latency of applications using ZooKeeper.
Increasing the ZooKeeper session timeout and tickTime to large values indicates that something else is likely wrong with your system and the session timeout is being used a bandaid to hid another problem. The default 30s timeout should be more than enough for most systems to happily operate.
Created 05-30-2017 08:04 AM
copied from zookeeper docs: tickTime: the length of a single tick, which is the basic time unit used by ZooKeeper, as measured in milliseconds. It is used to regulate heartbeats, and timeouts. For example, the minimum session timeout will be two ticks. so minimum is twice the tickTime and maximum is 20 times of the same. And yes I agree for most systems default 40s shall suffice, but for exceptional case, one needs to increase. (as also recommanded in HBase book: http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired)
Created 05-30-2017 01:57 PM
Yes, you are correct. I meant twenty times, not two.