Created 11-11-2016 07:56 PM
After a cluster restart, hbase hangs and then crashes a few minutes after it starts. Here are some of the errors.
2016-11-11 13:40:25,607 WARN [master:xxxx:60000] client.ScannerCallable: Ignore, probably already closed java.io.IOException: Call to cnhd003/10.56.200.113:60020 failed on local exception: org.apache.hadoop.hbase.ipc.RpcClient$CallTimeoutException: Call id=7, waitTime=60042, rpcTimeout=60000 at org.apache.hadoop.hbase.ipc.RpcClient.wrapException(RpcClient.java:1532) at org.apache.hadoop.hbase.ipc.RpcClient.call(RpcClient.java:1502) at org.apache.hadoop.hbase.ipc.RpcClient.callBlockingMethod(RpcClient.java:1684) at org.apache.hadoop.hbase.ipc.RpcClient$BlockingRpcChannelImplementation.callBlockingMethod(RpcClient.java:1737) at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$BlockingStub.scan(ClientProtos.java:29264) at org.apache.hadoop.hbase.client.ScannerCallable.close(ScannerCallable.java:285) at org.apache.hadoop.hbase.client.ScannerCallable.call(ScannerCallable.java:153) at org.apache.hadoop.hbase.client.ScannerCallable.call(ScannerCallable.java:57) at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:121) at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:97) at org.apache.hadoop.hbase.client.ClientScanner.close(ClientScanner.java:431) at org.apache.hadoop.hbase.catalog.MetaReader.fullScan(MetaReader.java:605) at org.apache.hadoop.hbase.catalog.MetaReader.fullScanOfMeta(MetaReader.java:139) at org.apache.hadoop.hbase.catalog.MetaMigrationConvertingToPB.isMetaTableUpdated(MetaMigrationConvertingToPB.java:164) at org.apache.hadoop.hbase.catalog.MetaMigrationConvertingToPB.updateMetaIfNecessary(MetaMigrationConvertingToPB.java:131) at org.apache.hadoop.hbase.master.HMaster.finishInitialization(HMaster.java:895) at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:609) at java.lang.Thread.run(Thread.java:744) Caused by: org.apache.hadoop.hbase.ipc.RpcClient$CallTimeoutException: Call id=7, waitTime=60042, rpcTimeout=60000 at org.apache.hadoop.hbase.ipc.RpcClient$Connection.cleanupCalls(RpcClient.java:1234) at org.apache.hadoop.hbase.ipc.RpcClient$Connection.readResponse(RpcClient.java:1171) at org.apache.hadoop.hbase.ipc.RpcClient$Connection.run(RpcClient.java:751)
2016-11-11 11:58:58,072 FATAL [master:xxxx:60000] master.HMaster: Master server abort: loaded coprocessors are: [] 2016-11-11 11:58:58,072 FATAL [master:xxxx:60000] master.HMaster: Unhandled exception. Starting shutdown. org.apache.hadoop.hbase.DoNotRetryIOException: Failed after retry of OutOfOrderScannerNextException: was there a rpc timeout? at org.apache.hadoop.hbase.client.ClientScanner.next(ClientScanner.java:384) at org.apache.hadoop.hbase.catalog.MetaReader.fullScan(MetaReader.java:599) at org.apache.hadoop.hbase.catalog.MetaReader.fullScanOfMeta(MetaReader.java:139) at org.apache.hadoop.hbase.catalog.MetaMigrationConvertingToPB.isMetaTableUpdated(MetaMigrationConvertingToPB.java:164) at org.apache.hadoop.hbase.catalog.MetaMigrationConvertingToPB.updateMetaIfNecessary(MetaMigrationConvertingToPB.java:131) at org.apache.hadoop.hbase.master.HMaster.finishInitialization(HMaster.java:895) at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:609) at java.lang.Thread.run(Thread.java:744)
Created 11-16-2016 02:36 PM
I was able to fix the problem by going into the the hbase zkcli and deleting the data under /hbase. Once that was deleted the hbase master and region servers started correctly.
Created 11-11-2016 07:57 PM
Check the state of the RegionServer before the Master reports this error. The Master is trying to read the hbase:meta table and failing. This should be a very fast operation.
Created 11-11-2016 08:06 PM
This is the error log from one of the region servers. It seems to work for a few minutes, then the master crashes.
2016-11-11 14:01:53,052 DEBUG [RpcServer.responder] ipc.RpcServer: RpcServer.responder: checking for old call responses. 2016-11-11 14:03:53,143 DEBUG [RpcServer.responder] ipc.RpcServer: RpcServer.responder: checking for old call responses. 2016-11-11 14:04:59,535 DEBUG [regionserver60020] regionserver.HRegionServer: No master found; retry 2016-11-11 14:05:02,537 DEBUG [regionserver60020] regionserver.HRegionServer: No master found; retry
Created 11-11-2016 08:54 PM
This hopefully is a bit more helpful. It looks like its compacting until the master crashes.
a/data/hbase/meta/1588230740/info/5853cea44784447a9d2fa9a08ac5c24d, keycount=1251, bloomtype=NONE, size=178.2 K, encoding=NONE, seqNum=11413 2016-11-11 14:47:24,192 DEBUG [regionserver60020-smallCompactions-1478895907565] compactions.Compactor: Compacting hdfs://xxxx-pvt.phibred.com:8020/apps/hbase/data/data/hbase/meta/1588230740/info/5a9e970b57f848469bb16ddf9bfa10f4, keycount=1249, bloomtype=NONE, size=178.0 K, encoding=NONE, seqNum=11413 2016-11-11 14:47:24,193 DEBUG [regionserver60020-smallCompactions-1478895907565] compactions.Compactor: Compacting hdfs://xxxx-pvt.phibred.com:8020/apps/hbase/data/data/hbase/meta/1588230740/info/b8484dfcd6d54176bba35f22fe76eb7c, keycount=1230, bloomtype=NONE, size=176.5 K, encoding=NONE, seqNum=11413 2016-11-11 14:47:24,193 DEBUG [regionserver60020-smallCompactions-1478895907565] compactions.Compactor: Compacting hdfs://xxxx-pvt.phibred.com:8020/apps/hbase/data/data/hbase/meta/1588230740/info/18cca3c929704887a0a5ec8a6920a7d0, keycount=1229, bloomtype=NONE, size=176.4 K, encoding=NONE, seqNum=11413 2016-11-11 14:47:24,239 INFO [regionserver60020-smallCompactions-1478895907565] regionserver.StoreFile$Reader: Loaded Delete Family Bloom (CompoundBloomFilter) metadata for 0537bf31a4c942bbb8caaa0c5cac2148 2016-11-11 14:47:24,269 DEBUG [regionserver60020-smallCompactions-1478895907565] regionserver.HRegionFileSystem: Committing store file hdfs://xxxx-pvt.phibred.com:8020/apps/hbase/data/data/hbase/meta/1588230740/.tmp/0537bf31a4c942bbb8caaa0c5cac2148 as hdfs://xxxx-pvt.phibred.com:8020/apps/hbase/data/data/hbase/meta/1588230740/info/0537bf31a4c942bbb8caaa0c5cac2148 2016-11-11 14:47:24,276 INFO [regionserver60020-smallCompactions-1478895907565] regionserver.StoreFile$Reader: Loaded Delete Family Bloom (CompoundBloomFilter) metadata for 0537bf31a4c942bbb8caaa0c5cac2148 2016-11-11 14:47:25,679 DEBUG [regionserver60020] regionserver.HRegionServer: No master found; retry 2016-11-11 14:47:28,680 DEBUG [regionserver60020] regionserver.HRegionServer: No master found; retry 2016-11-11 14:47:31,682 DEBUG [regionserver60020] regionserver.HRegionServer: No master found; retry 2016-11-11 14:47:34,683 DEBUG [regionserver60020] regionserver.HRegionServer: No master found; retry
Created 11-16-2016 02:36 PM
I was able to fix the problem by going into the the hbase zkcli and deleting the data under /hbase. Once that was deleted the hbase master and region servers started correctly.