Created 04-13-2017 12:59 PM
We upgraded from HDP 2.4 to 2.6. All services are starting. but hbase region servers are not able to communicate to the master with the following exception. any htoughts. nothing has changed after upgrade /etc/hosts entries are intact
2017-04-13 18:13:38,481 WARN [regionserver/172.16.3.197:16020] regionserver.HRegionServer: error telling master we are up com.google.protobuf.ServiceException: java.io.IOException: Call to abcdsolutions.local/172.16.3.196:16000 failed on local exception: org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=9, waitTime=10001, operationTimeout=10000 expired. at org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod(AbstractRpcClient.java:223) at org.apache.hadoop.hbase.ipc.AbstractRpcClient$BlockingRpcChannelImplementation.callBlockingMethod(AbstractRpcClient.java:287) at org.apache.hadoop.hbase.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$BlockingStub.regionServerStartup(RegionServerStatusProtos.java:10859) at org.apache.hadoop.hbase.regionserver.HRegionServer.reportForDuty(HRegionServer.java:2406) at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:936) at java.lang.Thread.run(Thread.java:745) Caused by: java.io.IOException: Call to abcdsolutions.local/172.16.3.196:16000 failed on local exception: org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=9, waitTime=10001, operationTimeout=10000 expired. at org.apache.hadoop.hbase.ipc.RpcClientImpl.wrapException(RpcClientImpl.java:1261) at org.apache.hadoop.hbase.ipc.RpcClientImpl.call(RpcClientImpl.java:1229) at org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod(AbstractRpcClient.java:213) ... 5 more Caused by: org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=9, waitTime=10001, operationTimeout=10000 expired. at org.apache.hadoop.hbase.ipc.Call.checkAndSetTimeout(Call.java:70) at org.apache.hadoop.hbase.ipc.RpcClientImpl.call(RpcClientImpl.java:1203) ... 6 more
Created 04-13-2017 03:58 PM
Have you performed a sanity check that the RegionServer which gave this error can connect to that IP+port? You can easily use telnet to perform this check (e.g. `telnet 172.16.3.196 16000`). If you get a connection refused error, either the HBase master is not running or there is a network issue preventing this node from talking to the Master.
Created 04-17-2017 03:32 AM
Hi @Josh Elser, i am able to telnet to the master without any issues. Also the master is running fine.
Created 04-17-2017 03:37 AM
Hi @Josh Elser, this is what i see in the Hbase master logs
master.ServerManager: Waiting for region servers count to settle; currently checked in 0, slept for 312811771 ms, expecting minimum of 1, maximum of 2147483647, timeout of 30000 ms, interval of 1500 ms.
Created 04-17-2017 02:33 PM
It sounds like there is no network issue between Master and RegionServer, so you would need to look at the HBase level instead. The error message you provided is only telling you that the Master has seen no RegionServers, but you already knew that because the error message in your question showed that the RegionServer failed to report to the Master. You need to figure out why this report is failing -- perhaps you should look at the DEBUG level logs.
Created 04-18-2017 11:43 AM
Hi @Josh Elser, this is what i could see in the logs in debug mode. I am seeing a ipc timeout exception.
rpc timeout is set around 15 mins. Any other parameter that i should look in?
2017-04-18 10:41:51,628 DEBUG [regionserver/abcd013.abcdsolutions.local/172.16.3.197:16020] ipc.AbstractRpcClient: Codec=org.apache.hadoop.hbase.codec.KeyValueCodec@6e4158de, compressor=null, tcpKeepAlive=true, tcpNoDelay=true, connectTO=10000, readTO=20000, writeTO=60000, minIdleTimeBeforeClose=120000, maxRetries=0, fallbackAllowed=false, bind address=null 2017-04-18 10:41:51,686 INFO [regionserver/abcd013.abcdsolutions.local/172.16.3.197:16020] regionserver.HRegionServer: ClusterId : a6d99271-51ab-41ed-a2d5-93049dedd72d 2017-04-18 10:41:51,693 DEBUG [regionserver/abcd013.abcdsolutions.local/172.16.3.197:16020] procedure.RegionServerProcedureManagerHost: Procedure flush-table-proc is initializing 2017-04-18 10:41:51,760 DEBUG [regionserver/abcd013.abcdsolutions.local/172.16.3.197:16020] zookeeper.RecoverableZooKeeper: Node /hbase-unsecure/flush-table-proc/acquired already exists 2017-04-18 10:41:51,782 DEBUG [regionserver/abcd013.abcdsolutions.local/172.16.3.197:16020] procedure.RegionServerProcedureManagerHost: Procedure flush-table-proc is initialized 2017-04-18 10:41:51,782 DEBUG [regionserver/abcd013.abcdsolutions.local/172.16.3.197:16020] procedure.RegionServerProcedureManagerHost: Procedure online-snapshot is initializing 2017-04-18 10:41:51,786 DEBUG [regionserver/abcd013.abcdsolutions.local/172.16.3.197:16020] zookeeper.RecoverableZooKeeper: Node /hbase-unsecure/online-snapshot/acquired already exists 2017-04-18 10:41:51,789 DEBUG [regionserver/abcd013.abcdsolutions.local/172.16.3.197:16020] procedure.RegionServerProcedureManagerHost: Procedure online-snapshot is initialized 2017-04-18 10:41:51,793 WARN [regionserver/abcd013.abcdsolutions.local/172.16.3.197:16020] util.HeapMemorySizeUtil: hbase.regionserver.global.memstore.lowerLimit is deprecated. Instead use hbase.regionserver.global.memstore.size.lower.limit 2017-04-18 10:41:51,795 INFO [regionserver/abcd013.abcdsolutions.local/172.16.3.197:16020] regionserver.MemStoreFlusher: globalMemStoreLimit=1.6 G, globalMemStoreLimitLowMark=1003.6 M, maxHeap=3.9 G 2017-04-18 10:41:51,800 INFO [regionserver/abcd013.abcdsolutions.local/172.16.3.197:16020] regionserver.HRegionServer: CompactionChecker runs every 10sec 2017-04-18 10:41:51,818 DEBUG [regionserver/abcd013.abcdsolutions.local/172.16.3.197:16020] ipc.AbstractRpcClient: Codec=org.apache.hadoop.hbase.codec.KeyValueCodec@2ea71264, compressor=null, tcpKeepAlive=true, tcpNoDelay=true, connectTO=10000, readTO=20000, writeTO=60000, minIdleTimeBeforeClose=120000, maxRetries=0, fallbackAllowed=false, bind address=abcd013.abcdsolutions.local/172.16.3.197:0 2017-04-18 10:41:51,823 DEBUG [regionserver/abcd013.abcdsolutions.local/172.16.3.197:16020] regionserver.ShutdownHook: Installed shutdown hook thread: Shutdownhook:regionserver/abcd013.abcdsolutions.local/172.16.3.197:16020 2017-04-18 10:41:51,870 INFO [regionserver/abcd013.abcdsolutions.local/172.16.3.197:16020] regionserver.RegionServerCoprocessorHost: System coprocessor loading is enabled 2017-04-18 10:41:51,870 INFO [regionserver/abcd013.abcdsolutions.local/172.16.3.197:16020] regionserver.RegionServerCoprocessorHost: Table coprocessor loading is enabled 2017-04-18 10:41:51,877 INFO [regionserver/abcd013.abcdsolutions.local/172.16.3.197:16020] regionserver.HRegionServer: reportForDuty to master=abcd011.abcdsolutions.local,16000,1492492302253 with port=16020, startcode=1492492307927 2017-04-18 10:41:52,101 DEBUG [regionserver/abcd013.abcdsolutions.local/172.16.3.197:16020] ipc.AbstractRpcClient: Use SIMPLE authentication for service RegionServerStatusService, sasl=false 2017-04-18 10:41:52,154 DEBUG [regionserver/abcd013.abcdsolutions.local/172.16.3.197:16020] ipc.AbstractRpcClient: Connecting to abcd011.abcdsolutions.local/172.16.3.195:16000 2017-04-18 10:42:02,024 WARN [regionserver/abcd013.abcdsolutions.local/172.16.3.197:16020] regionserver.HRegionServer: error telling master we are up com.google.protobuf.ServiceException: java.io.IOException: Call to abcd011.abcdsolutions.local/172.16.3.195:16000 failed on local exception: org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=0, waitTime=10001, operationTimeout=10000 expired. at org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod(AbstractRpcClient.java:223) at org.apache.hadoop.hbase.ipc.AbstractRpcClient$BlockingRpcChannelImplementation.callBlockingMethod(AbstractRpcClient.java:287) at org.apache.hadoop.hbase.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$BlockingStub.regionServerStartup(RegionServerStatusProtos.java:10859) at org.apache.hadoop.hbase.regionserver.HRegionServer.reportForDuty(HRegionServer.java:2406) at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:936)
Created 04-18-2017 11:30 AM
Hi @Josh Elser, I couldnt find anything in the debug logs
2017-04-18 10:41:51,628 DEBUG [regionserver/abcd013.abcdsolutions.local/172.16.3.197:16020] ipc.AbstractRpcClient: Codec=org.apache.hadoop.hbase.codec.KeyValueCodec@6e4158de, compressor=null, tcpKeepAlive=true, tcpNoDelay=true, connectTO=10000, readTO=20000, writeTO=60000, minIdleTimeBeforeClose=120000, maxRetries=0, fallbackAllowed=false, bind address=null 2017-04-18 10:41:51,686 INFO [regionserver/abcd013.abcdsolutions.local/172.16.3.197:16020] regionserver.HRegionServer: ClusterId : a6d99271-51ab-41ed-a2d5-93049dedd72d 2017-04-18 10:41:51,693 DEBUG [regionserver/abcd013.abcdsolutions.local/172.16.3.197:16020] procedure.RegionServerProcedureManagerHost: Procedure flush-table-proc is initializing 2017-04-18 10:41:51,760 DEBUG [regionserver/abcd013.abcdsolutions.local/172.16.3.197:16020] zookeeper.RecoverableZooKeeper: Node /hbase-unsecure/flush-table-proc/acquired already exists 2017-04-18 10:41:51,782 DEBUG [regionserver/abcd013.abcdsolutions.local/172.16.3.197:16020] procedure.RegionServerProcedureManagerHost: Procedure flush-table-proc is initialized 2017-04-18 10:41:51,782 DEBUG [regionserver/abcd013.abcdsolutions.local/172.16.3.197:16020] procedure.RegionServerProcedureManagerHost: Procedure online-snapshot is initializing 2017-04-18 10:41:51,786 DEBUG [regionserver/abcd013.abcdsolutions.local/172.16.3.197:16020] zookeeper.RecoverableZooKeeper: Node /hbase-unsecure/online-snapshot/acquired already exists 2017-04-18 10:41:51,789 DEBUG [regionserver/abcd013.abcdsolutions.local/172.16.3.197:16020] procedure.RegionServerProcedureManagerHost: Procedure online-snapshot is initialized 2017-04-18 10:41:51,793 WARN [regionserver/abcd013.abcdsolutions.local/172.16.3.197:16020] util.HeapMemorySizeUtil: hbase.regionserver.global.memstore.lowerLimit is deprecated. Instead use hbase.regionserver.global.memstore.size.lower.limit 2017-04-18 10:41:51,795 INFO [regionserver/abcd013.abcdsolutions.local/172.16.3.197:16020] regionserver.MemStoreFlusher: globalMemStoreLimit=1.6 G, globalMemStoreLimitLowMark=1003.6 M, maxHeap=3.9 G 2017-04-18 10:41:51,800 INFO [regionserver/abcd013.abcdsolutions.local/172.16.3.197:16020] regionserver.HRegionServer: CompactionChecker runs every 10sec 2017-04-18 10:41:51,818 DEBUG [regionserver/abcd013.abcdsolutions.local/172.16.3.197:16020] ipc.AbstractRpcClient: Codec=org.apache.hadoop.hbase.codec.KeyValueCodec@2ea71264, compressor=null, tcpKeepAlive=true, tcpNoDelay=true, connectTO=10000, readTO=20000, writeTO=60000, minIdleTimeBeforeClose=120000, maxRetries=0, fallbackAllowed=false, bind address=abcd013.abcdsolutions.local/172.16.3.197:0 2017-04-18 10:41:51,823 DEBUG [regionserver/abcd013.abcdsolutions.local/172.16.3.197:16020] regionserver.ShutdownHook: Installed shutdown hook thread: Shutdownhook:regionserver/abcd013.abcdsolutions.local/172.16.3.197:16020 2017-04-18 10:41:51,870 INFO [regionserver/abcd013.abcdsolutions.local/172.16.3.197:16020] regionserver.RegionServerCoprocessorHost: System coprocessor loading is enabled 2017-04-18 10:41:51,870 INFO [regionserver/abcd013.abcdsolutions.local/172.16.3.197:16020] regionserver.RegionServerCoprocessorHost: Table coprocessor loading is enabled 2017-04-18 10:41:51,877 INFO [regionserver/abcd013.abcdsolutions.local/172.16.3.197:16020] regionserver.HRegionServer: reportForDuty to master=abcd011.abcdsolutions.local,16000,1492492302253 with port=16020, startcode=1492492307927 2017-04-18 10:41:52,101 DEBUG [regionserver/abcd013.abcdsolutions.local/172.16.3.197:16020] ipc.AbstractRpcClient: Use SIMPLE authentication for service RegionServerStatusService, sasl=false 2017-04-18 10:41:52,154 DEBUG [regionserver/abcd013.abcdsolutions.local/172.16.3.197:16020] ipc.AbstractRpcClient: Connecting to abcd011.abcdsolutions.local/172.16.3.195:16000 2017-04-18 10:42:02,024 WARN [regionserver/abcd013.abcdsolutions.local/172.16.3.197:16020] regionserver.HRegionServer: error telling master we are up com.google.protobuf.ServiceException: java.io.IOException: Call to abcd011.abcdsolutions.local/172.16.3.195:16000 failed on local exception: org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=0, waitTime=10001, operationTimeout=10000 expired. at org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod(AbstractRpcClient.java:223) at org.apache.hadoop.hbase.ipc.AbstractRpcClient$BlockingRpcChannelImplementation.callBlockingMethod(AbstractRpcClient.java:287) at org.apache.hadoop.hbase.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$BlockingStub.regionServerStartup(RegionServerStatusProtos.java:10859) at org.apache.hadoop.hbase.regionserver.HRegionServer.reportForDuty(HRegionServer.java:2406) at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:936)
Created 04-18-2017 11:30 AM
Hi @Josh Elser, what i could see from this error
failed on local exception: org.apache.hadoop.hbase.ipc.CallTimeoutException::Call id=0, waitTime=10001, operationTimeout=10000 expired
is a timeout exception. The property for rpc timeout is having this value
<property> <name>hbase.rpc.timeout</name> <value>1800000</value> </property>
Is there any other property that needs to be set. one more fact is we are having hbase setup in HA mode and after upgrade the service got restarted. so there is a change of hbase active master after restart.
Created 09-20-2017 04:57 PM
I'm currently having the exact same problem. I can telnet to the master but the region servers won't connect. This is so weird. We've been down for a few hours. Did anyone solve this?