Support Questions

Find answers, ask questions, and share your expertise

HBase RegionServer can't start

avatar
Expert Contributor

On HDP 2.6.3 I have 4 regionserver for HBase one of them stops itself. Whenever I restart from Ambari within a few sconds It shuts down.

2018-05-18 17:17:13,465 INFO  [RpcServer.FifoWFPBQ.default.handler=28,queue=1,port=16020-SendThread(hadooptest3.datalonga.com:2181)] zookeeper.ClientCnxn: Opening socket connection to server hadooptest3.datalonga.com/10.251.55.183:2181. Will not attempt to authenticate using SASL (unknown error)
2018-05-18 17:17:13,466 INFO  [RpcServer.FifoWFPBQ.default.handler=28,queue=1,port=16020-SendThread(hadooptest3.datalonga.com:2181)] zookeeper.ClientCnxn: Socket connection established, initiating session, client: /10.251.55.188:46788, server: hadooptest3.datalonga.com/10.251.55.183:2181
2018-05-18 17:17:13,466 WARN  [RpcServer.FifoWFPBQ.default.handler=28,queue=1,port=16020-SendThread(hadooptest3.datalonga.com:2181)] zookeeper.ClientCnxn: Session 0x1636df6a4170002 for server hadooptest3.datalonga.com/10.251.55.183:2181, unexpected error, closing socket connection and attempting reconnect
java.io.IOException: Connection reset by peer
	at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
	at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
	at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
	at sun.nio.ch.IOUtil.read(IOUtil.java:192)
	at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
	at org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:68)
	at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:366)
	at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1125)
2018-05-18 17:17:13,509 INFO  [RpcServer.FifoWFPBQ.default.handler=29,queue=2,port=16020-SendThread(hadooptest1.datalonga.com:2181)] zookeeper.ClientCnxn: Opening socket connection to server hadooptest1.datalonga.com/10.251.55.181:2181. Will not attempt to authenticate using SASL (unknown error)
2018-05-18 17:17:13,509 INFO  [RpcServer.FifoWFPBQ.default.handler=29,queue=2,port=16020-SendThread(hadooptest1.datalonga.com:2181)] zookeeper.ClientCnxn: Socket connection established, initiating session, client: /10.251.55.188:54280, server: hadooptest1.datalonga.com/10.251.55.181:2181
2018-05-18 17:17:13,510 WARN  [RpcServer.FifoWFPBQ.default.handler=29,queue=2,port=16020-SendThread(hadooptest1.datalonga.com:2181)] zookeeper.ClientCnxn: Session 0x26348dea9b50739 for server hadooptest1.datalonga.com/10.251.55.181:2181, unexpected error, closing socket connection and attempting reconnect
java.io.IOException: Connection reset by peer
	at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
	at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
	at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)

the logs repeat like above for hundresds MBs.

6 REPLIES 6

avatar
Expert Contributor

Looks like it is connection issue between Zookeeper and region server. Can you provide region server logs and zookeeper logs ?

avatar
Expert Contributor

Thanks @schhabra

The region server logs are below:

2018-05-23 07:42:13,313 WARN  [RpcServer.FifoWFPBQ.default.handler=0,queue=0,port=16020] zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper, quorum=hadooptest1.datalonga.com:2181,hadooptest2.datalonga.com:2181,hadooptest3.datalonga.com:2181, exception=org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase/meta-region-server
2018-05-23 07:42:13,313 ERROR [RpcServer.FifoWFPBQ.default.handler=0,queue=0,port=16020] zookeeper.RecoverableZooKeeper: ZooKeeper getData failed after 7 attempts
2018-05-23 07:42:13,313 WARN  [RpcServer.FifoWFPBQ.default.handler=0,queue=0,port=16020] zookeeper.ZKUtil: hconnection-0x42873e880x0, quorum=hadooptest1.datalonga.com:2181,hadooptest2.datalonga.com:2181,hadooptest3.datalonga.com:2181, baseZNode=/hbase Unable to get data of znode /hbase/meta-region-server
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase/meta-region-server
	2018-05-23 07:42:13,314 ERROR [RpcServer.FifoWFPBQ.default.handler=0,queue=0,port=16020] zookeeper.ZooKeeperWatcher: hconnection-0x42873e880x0, quorum=hadooptest1.datalonga.com:2181,hadooptest2.datalonga.com:2181,hadooptest3.datalonga.com:2181, baseZNode=/hbase Received unexpected KeeperException, re-throwing exception
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase/meta-region-server
	2018-05-23 07:42:13,315 ERROR [RpcServer.FifoWFPBQ.default.handler=0,queue=0,port=16020] write.KillServerOnFailurePolicy: Could not update the index table, killing server region because couldn't write to an index table
org.apache.phoenix.hbase.index.exception.SingleIndexWriteFailureException: Failed 4 actions: Table 'CITY_I' was not found, got: ATLAS_ENTITY_AUDIT_EVENTS.: 4 times, 
	at org.apache.phoenix.hbase.index.write.ParallelWriterIndexCommitter$1.call(ParallelWriterIndexCommitter.java:168)
	at org.apache.phoenix.hbase.index.write.ParallelWriterIndexCommitter$1.call(ParallelWriterIndexCommitter.java:132)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException: Failed 4 actions: Table 'CITY_I' was not found, got: ATLAS_ENTITY_AUDIT_EVENTS.: 4 times, 
	at org.apache.hadoop.hbase.client.AsyncProcess$BatchErrors.makeException(AsyncProcess.java:235)
	at org.apache.hadoop.hbase.client.AsyncProcess$BatchErrors.access$1700(AsyncProcess.java:215)
	at org.apache.hadoop.hbase.client.AsyncProcess$AsyncRequestFutureImpl.getErrors(AsyncProcess.java:1625)
	at org.apache.hadoop.hbase.client.HTable.batch(HTable.java:913)
	at org.apache.hadoop.hbase.client.HTable.batch(HTable.java:927)
	at org.apache.phoenix.execute.DelegateHTable.batch(DelegateHTable.java:94)
	at org.apache.phoenix.hbase.index.write.ParallelWriterIndexCommitter$1.call(ParallelWriterIndexCommitter.java:164)
	... 5 more
2018-05-23 07:42:13,315 FATAL [RpcServer.FifoWFPBQ.default.handler=0,queue=0,port=16020] regionserver.HRegionServer: ABORTING region server hadooptest8.datalonga.com,16020,1526882638915: Could not update the index table, killing server region because couldn't write to an index table
org.apache.phoenix.hbase.index.exception.SingleIndexWriteFailureException: Failed 4 actions: Table 'CITY_I' was not found, got: ATLAS_ENTITY_AUDIT_EVENTS.: 4 times, 
	at org.apache.phoenix.hbase.index.write.ParallelWriterIndexCommitter$1.call(ParallelWriterIndexCommitter.java:168)
	at org.apache.phoenix.hbase.index.write.ParallelWriterIndexCommitter$1.call(ParallelWriterIndexCommitter.java:132)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException: Failed 4 actions: Table 'CITY_I' was not found, got: ATLAS_ENTITY_AUDIT_EVENTS.: 4 times, 
	2018-05-23 07:42:13,315 FATAL [RpcServer.FifoWFPBQ.default.handler=0,queue=0,port=16020] regionserver.HRegionServer: RegionServer abort: loaded coprocessors are: [org.apache.phoenix.coprocessor.ScanRegionObserver, org.apache.phoenix.coprocessor.UngroupedAggregateRegionObserver, org.apache.phoenix.hbase.index.Indexer, org.apache.phoenix.coprocessor.GroupedAggregateRegionObserver, org.apache.phoenix.coprocessor.ServerCachingEndpointImpl, org.apache.hadoop.hbase.security.access.SecureBulkLoadEndpoint]
2018-05-23 07:42:13,327 INFO  [RpcServer.FifoWFPBQ.default.handler=0,queue=0,port=16020] regionserver.HRegionServer: Dump of metrics as JSON on abort: 
2018-05-23 07:42:13,330 INFO  [RpcServer.FifoWFPBQ.default.handler=0,queue=0,port=16020] util.IndexManagementUtil: Rethrowing org.apache.hadoop.hbase.DoNotRetryIOException: Failed 4 actions: Table 'CITY_I' was not found, got: ATLAS_ENTITY_AUDIT_EVENTS.: 4 times, 
2018-05-23 07:42:13,331 ERROR [RpcServer.FifoWFPBQ.default.handler=0,queue=0,port=16020] coprocessor.UngroupedAggregateRegionObserver: IOException during rebuilding: org.apache.hadoop.hbase.DoNotRetryIOException: Failed 4 actions: Table 'CITY_I' was not found, got: ATLAS_ENTITY_AUDIT_EVENTS.: 4 times, 
at org.apache.phoenix.util.ServerUtil.createIOException(ServerUtil.java:77)
	
	Caused by: org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException: Failed 4 actions: Table 'CITY_I' was not found, got: ATLAS_ENTITY_AUDIT_EVENTS.: 4 times, 
at org.apache.hadoop.hbase.client.AsyncProcess$BatchErrors.makeException(AsyncProcess.java:235)
	

There is no zookeeper log.

avatar
Master Mentor

@Erkan ŞİRİN

Please ensure your Atlas, zookeeper and hbase is up and running the table ATLAS_ENTITY_AUDIT_EVENTS

Check the entries in zookeeper see my entries

 ./bin/zkCli.sh
Connecting to localhost:2181
Welcome to ZooKeeper!
2018-05-23 16:45:58,797 - INFO  [main-SendThread(localhost:2181):ClientCnxn$SendThread@1019] - Opening socket connection to server localhost/0:0:0:0:0:0:0:1:2181. Will not attempt to authenticate using SASL (unknown error)
JLine support is enabled
......
.......
[zk: localhost:2181(CONNECTED) 0] ls /
[registry, cluster, brokers, storm, zookeeper, infra-solr, hbase-unsecure, admin, isr_change_notification, templeton-hadoop, hiveserver2, controller_epoch, druid, rmstore, ambari-metrics-cluster, consumers, config]
[zk: localhost:2181(CONNECTED) 1] ls /hbase-unsecure/table
[ATLAS_ENTITY_AUDIT_EVENTS, hbase:meta, hbase:namespace, atlas_titan, hbase:acl]
[zk: localhost:2181(CONNECTED) 2]

From your HBase shell

hbase(main):001:0> list
TABLE
ATLAS_ENTITY_AUDIT_EVENTS
atlas_titan
2 row(s) in 18.5760 seconds
=> ["ATLAS_ENTITY_AUDIT_EVENTS", "atlas_titan"]
hbase(main):002:0>
  • Stop Atlas via Ambari.
  • In hbase terminal, to disable hbase table, run this command.
disable 'atlas_titan'
  • In hbase terminal, to drop hbase table, run this command.
drop 'atlas_titan'
  • Start Atlas via Ambari.

The above steps can be repeated for 'ATLAS_ENTITY_AUDIT_EVENTS' table if there is a requirement to wipe-out audit data as well.

This above steps should reset atlas and start it as if it is a fresh installation

Hope that helps

avatar
Expert Contributor

Thank you very much @Geoffrey Shelton Okot for your time. I have followed your suggestion step by step except Atlas. I don't have Atlas, even so, I have dropped the 'atlas_titan' table in HBase. I will monitor RegionServers behaviours from now on. I will notify the results in here.

avatar
Master Mentor

@Erkan ŞİRİN

So has the regional server started successfully?

avatar
Expert Contributor

@Geoffrey Shelton Okot

Yes it did. But it stopped saturday morning again. Some of logs:

- org.apache.hadoop.hbase.DoNotRetryIOException: hconnection-0x55af174a closed
		- ERROR [phoenix-update-statistics-3] stats.StatisticsScanner: Failed to update statistics table!
org.apache.hadoop.hbase.DoNotRetryIOException: hconnection-0x10e9f278 closed
		- ERROR [main] regionserver.HRegionServerCommandLine: Region server exiting
java.lang.RuntimeException: HRegionServer Aborted
		- java.io.IOException: Connection reset by peer
		- org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase/hbaseid