Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Hbase region servers not able to connect to master after upgrade

avatar

We upgraded from HDP 2.4 to 2.6. All services are starting. but hbase region servers are not able to communicate to the master with the following exception. any htoughts. nothing has changed after upgrade /etc/hosts entries are intact

2017-04-13 18:13:38,481 WARN  [regionserver/172.16.3.197:16020] regionserver.HRegionServer: error telling master we are up
com.google.protobuf.ServiceException: java.io.IOException: Call to abcdsolutions.local/172.16.3.196:16000 failed on local exception: org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=9, waitTime=10001, operationTimeout=10000 expired.
        at org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod(AbstractRpcClient.java:223)
        at org.apache.hadoop.hbase.ipc.AbstractRpcClient$BlockingRpcChannelImplementation.callBlockingMethod(AbstractRpcClient.java:287)
        at org.apache.hadoop.hbase.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$BlockingStub.regionServerStartup(RegionServerStatusProtos.java:10859)
        at org.apache.hadoop.hbase.regionserver.HRegionServer.reportForDuty(HRegionServer.java:2406)
        at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:936)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Call to abcdsolutions.local/172.16.3.196:16000 failed on local exception: org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=9, waitTime=10001, operationTimeout=10000 expired.
        at org.apache.hadoop.hbase.ipc.RpcClientImpl.wrapException(RpcClientImpl.java:1261)
        at org.apache.hadoop.hbase.ipc.RpcClientImpl.call(RpcClientImpl.java:1229)
        at org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod(AbstractRpcClient.java:213)
        ... 5 more
Caused by: org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=9, waitTime=10001, operationTimeout=10000 expired.
        at org.apache.hadoop.hbase.ipc.Call.checkAndSetTimeout(Call.java:70)
        at org.apache.hadoop.hbase.ipc.RpcClientImpl.call(RpcClientImpl.java:1203)
        ... 6 more
8 REPLIES 8

avatar
Super Guru

Have you performed a sanity check that the RegionServer which gave this error can connect to that IP+port? You can easily use telnet to perform this check (e.g. `telnet 172.16.3.196 16000`). If you get a connection refused error, either the HBase master is not running or there is a network issue preventing this node from talking to the Master.

avatar

Hi @Josh Elser, i am able to telnet to the master without any issues. Also the master is running fine.

avatar

Hi @Josh Elser, this is what i see in the Hbase master logs

master.ServerManager: Waiting for region servers count to settle; currently checked in 0, slept for 312811771 ms, expecting minimum of 1, maximum of 2147483647, timeout of 30000 ms, interval of 1500 ms.

avatar
Super Guru

It sounds like there is no network issue between Master and RegionServer, so you would need to look at the HBase level instead. The error message you provided is only telling you that the Master has seen no RegionServers, but you already knew that because the error message in your question showed that the RegionServer failed to report to the Master. You need to figure out why this report is failing -- perhaps you should look at the DEBUG level logs.

avatar

Hi @Josh Elser, this is what i could see in the logs in debug mode. I am seeing a ipc timeout exception.

rpc timeout is set around 15 mins. Any other parameter that i should look in?

2017-04-18 10:41:51,628 DEBUG [regionserver/abcd013.abcdsolutions.local/172.16.3.197:16020] ipc.AbstractRpcClient: Codec=org.apache.hadoop.hbase.codec.KeyValueCodec@6e4158de, compressor=null, tcpKeepAlive=true, tcpNoDelay=true, connectTO=10000, readTO=20000, writeTO=60000, minIdleTimeBeforeClose=120000, maxRetries=0, fallbackAllowed=false, bind address=null 2017-04-18 10:41:51,686 INFO [regionserver/abcd013.abcdsolutions.local/172.16.3.197:16020] regionserver.HRegionServer: ClusterId : a6d99271-51ab-41ed-a2d5-93049dedd72d 2017-04-18 10:41:51,693 DEBUG [regionserver/abcd013.abcdsolutions.local/172.16.3.197:16020] procedure.RegionServerProcedureManagerHost: Procedure flush-table-proc is initializing 2017-04-18 10:41:51,760 DEBUG [regionserver/abcd013.abcdsolutions.local/172.16.3.197:16020] zookeeper.RecoverableZooKeeper: Node /hbase-unsecure/flush-table-proc/acquired already exists 2017-04-18 10:41:51,782 DEBUG [regionserver/abcd013.abcdsolutions.local/172.16.3.197:16020] procedure.RegionServerProcedureManagerHost: Procedure flush-table-proc is initialized 2017-04-18 10:41:51,782 DEBUG [regionserver/abcd013.abcdsolutions.local/172.16.3.197:16020] procedure.RegionServerProcedureManagerHost: Procedure online-snapshot is initializing 2017-04-18 10:41:51,786 DEBUG [regionserver/abcd013.abcdsolutions.local/172.16.3.197:16020] zookeeper.RecoverableZooKeeper: Node /hbase-unsecure/online-snapshot/acquired already exists 2017-04-18 10:41:51,789 DEBUG [regionserver/abcd013.abcdsolutions.local/172.16.3.197:16020] procedure.RegionServerProcedureManagerHost: Procedure online-snapshot is initialized 2017-04-18 10:41:51,793 WARN [regionserver/abcd013.abcdsolutions.local/172.16.3.197:16020] util.HeapMemorySizeUtil: hbase.regionserver.global.memstore.lowerLimit is deprecated. Instead use hbase.regionserver.global.memstore.size.lower.limit 2017-04-18 10:41:51,795 INFO [regionserver/abcd013.abcdsolutions.local/172.16.3.197:16020] regionserver.MemStoreFlusher: globalMemStoreLimit=1.6 G, globalMemStoreLimitLowMark=1003.6 M, maxHeap=3.9 G 2017-04-18 10:41:51,800 INFO [regionserver/abcd013.abcdsolutions.local/172.16.3.197:16020] regionserver.HRegionServer: CompactionChecker runs every 10sec 2017-04-18 10:41:51,818 DEBUG [regionserver/abcd013.abcdsolutions.local/172.16.3.197:16020] ipc.AbstractRpcClient: Codec=org.apache.hadoop.hbase.codec.KeyValueCodec@2ea71264, compressor=null, tcpKeepAlive=true, tcpNoDelay=true, connectTO=10000, readTO=20000, writeTO=60000, minIdleTimeBeforeClose=120000, maxRetries=0, fallbackAllowed=false, bind address=abcd013.abcdsolutions.local/172.16.3.197:0 2017-04-18 10:41:51,823 DEBUG [regionserver/abcd013.abcdsolutions.local/172.16.3.197:16020] regionserver.ShutdownHook: Installed shutdown hook thread: Shutdownhook:regionserver/abcd013.abcdsolutions.local/172.16.3.197:16020 2017-04-18 10:41:51,870 INFO [regionserver/abcd013.abcdsolutions.local/172.16.3.197:16020] regionserver.RegionServerCoprocessorHost: System coprocessor loading is enabled 2017-04-18 10:41:51,870 INFO [regionserver/abcd013.abcdsolutions.local/172.16.3.197:16020] regionserver.RegionServerCoprocessorHost: Table coprocessor loading is enabled 2017-04-18 10:41:51,877 INFO [regionserver/abcd013.abcdsolutions.local/172.16.3.197:16020] regionserver.HRegionServer: reportForDuty to master=abcd011.abcdsolutions.local,16000,1492492302253 with port=16020, startcode=1492492307927 2017-04-18 10:41:52,101 DEBUG [regionserver/abcd013.abcdsolutions.local/172.16.3.197:16020] ipc.AbstractRpcClient: Use SIMPLE authentication for service RegionServerStatusService, sasl=false 2017-04-18 10:41:52,154 DEBUG [regionserver/abcd013.abcdsolutions.local/172.16.3.197:16020] ipc.AbstractRpcClient: Connecting to abcd011.abcdsolutions.local/172.16.3.195:16000 2017-04-18 10:42:02,024 WARN [regionserver/abcd013.abcdsolutions.local/172.16.3.197:16020] regionserver.HRegionServer: error telling master we are up com.google.protobuf.ServiceException: java.io.IOException: Call to abcd011.abcdsolutions.local/172.16.3.195:16000 failed on local exception: org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=0, waitTime=10001, operationTimeout=10000 expired. at org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod(AbstractRpcClient.java:223) at org.apache.hadoop.hbase.ipc.AbstractRpcClient$BlockingRpcChannelImplementation.callBlockingMethod(AbstractRpcClient.java:287) at org.apache.hadoop.hbase.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$BlockingStub.regionServerStartup(RegionServerStatusProtos.java:10859) at org.apache.hadoop.hbase.regionserver.HRegionServer.reportForDuty(HRegionServer.java:2406) at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:936)

avatar

Hi @Josh Elser, I couldnt find anything in the debug logs

2017-04-18 10:41:51,628 DEBUG [regionserver/abcd013.abcdsolutions.local/172.16.3.197:16020] ipc.AbstractRpcClient: Codec=org.apache.hadoop.hbase.codec.KeyValueCodec@6e4158de, compressor=null, tcpKeepAlive=true, tcpNoDelay=true, connectTO=10000, readTO=20000, writeTO=60000, minIdleTimeBeforeClose=120000, maxRetries=0, fallbackAllowed=false, bind address=null
2017-04-18 10:41:51,686 INFO  [regionserver/abcd013.abcdsolutions.local/172.16.3.197:16020] regionserver.HRegionServer: ClusterId : a6d99271-51ab-41ed-a2d5-93049dedd72d
2017-04-18 10:41:51,693 DEBUG [regionserver/abcd013.abcdsolutions.local/172.16.3.197:16020] procedure.RegionServerProcedureManagerHost: Procedure flush-table-proc is initializing
2017-04-18 10:41:51,760 DEBUG [regionserver/abcd013.abcdsolutions.local/172.16.3.197:16020] zookeeper.RecoverableZooKeeper: Node /hbase-unsecure/flush-table-proc/acquired already exists
2017-04-18 10:41:51,782 DEBUG [regionserver/abcd013.abcdsolutions.local/172.16.3.197:16020] procedure.RegionServerProcedureManagerHost: Procedure flush-table-proc is initialized
2017-04-18 10:41:51,782 DEBUG [regionserver/abcd013.abcdsolutions.local/172.16.3.197:16020] procedure.RegionServerProcedureManagerHost: Procedure online-snapshot is initializing
2017-04-18 10:41:51,786 DEBUG [regionserver/abcd013.abcdsolutions.local/172.16.3.197:16020] zookeeper.RecoverableZooKeeper: Node /hbase-unsecure/online-snapshot/acquired already exists
2017-04-18 10:41:51,789 DEBUG [regionserver/abcd013.abcdsolutions.local/172.16.3.197:16020] procedure.RegionServerProcedureManagerHost: Procedure online-snapshot is initialized
2017-04-18 10:41:51,793 WARN  [regionserver/abcd013.abcdsolutions.local/172.16.3.197:16020] util.HeapMemorySizeUtil: hbase.regionserver.global.memstore.lowerLimit is deprecated. Instead use hbase.regionserver.global.memstore.size.lower.limit
2017-04-18 10:41:51,795 INFO  [regionserver/abcd013.abcdsolutions.local/172.16.3.197:16020] regionserver.MemStoreFlusher: globalMemStoreLimit=1.6 G, globalMemStoreLimitLowMark=1003.6 M, maxHeap=3.9 G
2017-04-18 10:41:51,800 INFO  [regionserver/abcd013.abcdsolutions.local/172.16.3.197:16020] regionserver.HRegionServer: CompactionChecker runs every 10sec
2017-04-18 10:41:51,818 DEBUG [regionserver/abcd013.abcdsolutions.local/172.16.3.197:16020] ipc.AbstractRpcClient: Codec=org.apache.hadoop.hbase.codec.KeyValueCodec@2ea71264, compressor=null, tcpKeepAlive=true, tcpNoDelay=true, connectTO=10000, readTO=20000, writeTO=60000, minIdleTimeBeforeClose=120000, maxRetries=0, fallbackAllowed=false, bind address=abcd013.abcdsolutions.local/172.16.3.197:0
2017-04-18 10:41:51,823 DEBUG [regionserver/abcd013.abcdsolutions.local/172.16.3.197:16020] regionserver.ShutdownHook: Installed shutdown hook thread: Shutdownhook:regionserver/abcd013.abcdsolutions.local/172.16.3.197:16020
2017-04-18 10:41:51,870 INFO  [regionserver/abcd013.abcdsolutions.local/172.16.3.197:16020] regionserver.RegionServerCoprocessorHost: System coprocessor loading is enabled
2017-04-18 10:41:51,870 INFO  [regionserver/abcd013.abcdsolutions.local/172.16.3.197:16020] regionserver.RegionServerCoprocessorHost: Table coprocessor loading is enabled
2017-04-18 10:41:51,877 INFO  [regionserver/abcd013.abcdsolutions.local/172.16.3.197:16020] regionserver.HRegionServer: reportForDuty to master=abcd011.abcdsolutions.local,16000,1492492302253 with port=16020, startcode=1492492307927
2017-04-18 10:41:52,101 DEBUG [regionserver/abcd013.abcdsolutions.local/172.16.3.197:16020] ipc.AbstractRpcClient: Use SIMPLE authentication for service RegionServerStatusService, sasl=false
2017-04-18 10:41:52,154 DEBUG [regionserver/abcd013.abcdsolutions.local/172.16.3.197:16020] ipc.AbstractRpcClient: Connecting to abcd011.abcdsolutions.local/172.16.3.195:16000
2017-04-18 10:42:02,024 WARN  [regionserver/abcd013.abcdsolutions.local/172.16.3.197:16020] regionserver.HRegionServer: error telling master we are up
com.google.protobuf.ServiceException: java.io.IOException: Call to abcd011.abcdsolutions.local/172.16.3.195:16000 failed on local exception: org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=0, waitTime=10001, operationTimeout=10000 expired.
        at org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod(AbstractRpcClient.java:223)
        at org.apache.hadoop.hbase.ipc.AbstractRpcClient$BlockingRpcChannelImplementation.callBlockingMethod(AbstractRpcClient.java:287)
        at org.apache.hadoop.hbase.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$BlockingStub.regionServerStartup(RegionServerStatusProtos.java:10859)
        at org.apache.hadoop.hbase.regionserver.HRegionServer.reportForDuty(HRegionServer.java:2406)
        at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:936)

avatar

Hi @Josh Elser, what i could see from this error

failed on local exception: org.apache.hadoop.hbase.ipc.CallTimeoutException::Call id=0, waitTime=10001, operationTimeout=10000 expired

is a timeout exception. The property for rpc timeout is having this value

<property> <name>hbase.rpc.timeout</name> <value>1800000</value> </property>

Is there any other property that needs to be set. one more fact is we are having hbase setup in HA mode and after upgrade the service got restarted. so there is a change of hbase active master after restart.

avatar
Contributor

I'm currently having the exact same problem. I can telnet to the master but the region servers won't connect. This is so weird. We've been down for a few hours. Did anyone solve this?