Created 05-18-2018 02:29 PM
On HDP 2.6.3 I have 4 regionserver for HBase one of them stops itself. Whenever I restart from Ambari within a few sconds It shuts down.
2018-05-18 17:17:13,465 INFO [RpcServer.FifoWFPBQ.default.handler=28,queue=1,port=16020-SendThread(hadooptest3.datalonga.com:2181)] zookeeper.ClientCnxn: Opening socket connection to server hadooptest3.datalonga.com/10.251.55.183:2181. Will not attempt to authenticate using SASL (unknown error) 2018-05-18 17:17:13,466 INFO [RpcServer.FifoWFPBQ.default.handler=28,queue=1,port=16020-SendThread(hadooptest3.datalonga.com:2181)] zookeeper.ClientCnxn: Socket connection established, initiating session, client: /10.251.55.188:46788, server: hadooptest3.datalonga.com/10.251.55.183:2181 2018-05-18 17:17:13,466 WARN [RpcServer.FifoWFPBQ.default.handler=28,queue=1,port=16020-SendThread(hadooptest3.datalonga.com:2181)] zookeeper.ClientCnxn: Session 0x1636df6a4170002 for server hadooptest3.datalonga.com/10.251.55.183:2181, unexpected error, closing socket connection and attempting reconnect java.io.IOException: Connection reset by peer at sun.nio.ch.FileDispatcherImpl.read0(Native Method) at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) at sun.nio.ch.IOUtil.read(IOUtil.java:192) at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380) at org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:68) at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:366) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1125) 2018-05-18 17:17:13,509 INFO [RpcServer.FifoWFPBQ.default.handler=29,queue=2,port=16020-SendThread(hadooptest1.datalonga.com:2181)] zookeeper.ClientCnxn: Opening socket connection to server hadooptest1.datalonga.com/10.251.55.181:2181. Will not attempt to authenticate using SASL (unknown error) 2018-05-18 17:17:13,509 INFO [RpcServer.FifoWFPBQ.default.handler=29,queue=2,port=16020-SendThread(hadooptest1.datalonga.com:2181)] zookeeper.ClientCnxn: Socket connection established, initiating session, client: /10.251.55.188:54280, server: hadooptest1.datalonga.com/10.251.55.181:2181 2018-05-18 17:17:13,510 WARN [RpcServer.FifoWFPBQ.default.handler=29,queue=2,port=16020-SendThread(hadooptest1.datalonga.com:2181)] zookeeper.ClientCnxn: Session 0x26348dea9b50739 for server hadooptest1.datalonga.com/10.251.55.181:2181, unexpected error, closing socket connection and attempting reconnect java.io.IOException: Connection reset by peer at sun.nio.ch.FileDispatcherImpl.read0(Native Method) at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
the logs repeat like above for hundresds MBs.
Created 05-22-2018 07:01 PM
Looks like it is connection issue between Zookeeper and region server. Can you provide region server logs and zookeeper logs ?
Created 05-23-2018 06:38 AM
Thanks @schhabra
The region server logs are below:
2018-05-23 07:42:13,313 WARN [RpcServer.FifoWFPBQ.default.handler=0,queue=0,port=16020] zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper, quorum=hadooptest1.datalonga.com:2181,hadooptest2.datalonga.com:2181,hadooptest3.datalonga.com:2181, exception=org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase/meta-region-server 2018-05-23 07:42:13,313 ERROR [RpcServer.FifoWFPBQ.default.handler=0,queue=0,port=16020] zookeeper.RecoverableZooKeeper: ZooKeeper getData failed after 7 attempts 2018-05-23 07:42:13,313 WARN [RpcServer.FifoWFPBQ.default.handler=0,queue=0,port=16020] zookeeper.ZKUtil: hconnection-0x42873e880x0, quorum=hadooptest1.datalonga.com:2181,hadooptest2.datalonga.com:2181,hadooptest3.datalonga.com:2181, baseZNode=/hbase Unable to get data of znode /hbase/meta-region-server org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase/meta-region-server 2018-05-23 07:42:13,314 ERROR [RpcServer.FifoWFPBQ.default.handler=0,queue=0,port=16020] zookeeper.ZooKeeperWatcher: hconnection-0x42873e880x0, quorum=hadooptest1.datalonga.com:2181,hadooptest2.datalonga.com:2181,hadooptest3.datalonga.com:2181, baseZNode=/hbase Received unexpected KeeperException, re-throwing exception org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase/meta-region-server 2018-05-23 07:42:13,315 ERROR [RpcServer.FifoWFPBQ.default.handler=0,queue=0,port=16020] write.KillServerOnFailurePolicy: Could not update the index table, killing server region because couldn't write to an index table org.apache.phoenix.hbase.index.exception.SingleIndexWriteFailureException: Failed 4 actions: Table 'CITY_I' was not found, got: ATLAS_ENTITY_AUDIT_EVENTS.: 4 times, at org.apache.phoenix.hbase.index.write.ParallelWriterIndexCommitter$1.call(ParallelWriterIndexCommitter.java:168) at org.apache.phoenix.hbase.index.write.ParallelWriterIndexCommitter$1.call(ParallelWriterIndexCommitter.java:132) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException: Failed 4 actions: Table 'CITY_I' was not found, got: ATLAS_ENTITY_AUDIT_EVENTS.: 4 times, at org.apache.hadoop.hbase.client.AsyncProcess$BatchErrors.makeException(AsyncProcess.java:235) at org.apache.hadoop.hbase.client.AsyncProcess$BatchErrors.access$1700(AsyncProcess.java:215) at org.apache.hadoop.hbase.client.AsyncProcess$AsyncRequestFutureImpl.getErrors(AsyncProcess.java:1625) at org.apache.hadoop.hbase.client.HTable.batch(HTable.java:913) at org.apache.hadoop.hbase.client.HTable.batch(HTable.java:927) at org.apache.phoenix.execute.DelegateHTable.batch(DelegateHTable.java:94) at org.apache.phoenix.hbase.index.write.ParallelWriterIndexCommitter$1.call(ParallelWriterIndexCommitter.java:164) ... 5 more 2018-05-23 07:42:13,315 FATAL [RpcServer.FifoWFPBQ.default.handler=0,queue=0,port=16020] regionserver.HRegionServer: ABORTING region server hadooptest8.datalonga.com,16020,1526882638915: Could not update the index table, killing server region because couldn't write to an index table org.apache.phoenix.hbase.index.exception.SingleIndexWriteFailureException: Failed 4 actions: Table 'CITY_I' was not found, got: ATLAS_ENTITY_AUDIT_EVENTS.: 4 times, at org.apache.phoenix.hbase.index.write.ParallelWriterIndexCommitter$1.call(ParallelWriterIndexCommitter.java:168) at org.apache.phoenix.hbase.index.write.ParallelWriterIndexCommitter$1.call(ParallelWriterIndexCommitter.java:132) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException: Failed 4 actions: Table 'CITY_I' was not found, got: ATLAS_ENTITY_AUDIT_EVENTS.: 4 times, 2018-05-23 07:42:13,315 FATAL [RpcServer.FifoWFPBQ.default.handler=0,queue=0,port=16020] regionserver.HRegionServer: RegionServer abort: loaded coprocessors are: [org.apache.phoenix.coprocessor.ScanRegionObserver, org.apache.phoenix.coprocessor.UngroupedAggregateRegionObserver, org.apache.phoenix.hbase.index.Indexer, org.apache.phoenix.coprocessor.GroupedAggregateRegionObserver, org.apache.phoenix.coprocessor.ServerCachingEndpointImpl, org.apache.hadoop.hbase.security.access.SecureBulkLoadEndpoint] 2018-05-23 07:42:13,327 INFO [RpcServer.FifoWFPBQ.default.handler=0,queue=0,port=16020] regionserver.HRegionServer: Dump of metrics as JSON on abort: 2018-05-23 07:42:13,330 INFO [RpcServer.FifoWFPBQ.default.handler=0,queue=0,port=16020] util.IndexManagementUtil: Rethrowing org.apache.hadoop.hbase.DoNotRetryIOException: Failed 4 actions: Table 'CITY_I' was not found, got: ATLAS_ENTITY_AUDIT_EVENTS.: 4 times, 2018-05-23 07:42:13,331 ERROR [RpcServer.FifoWFPBQ.default.handler=0,queue=0,port=16020] coprocessor.UngroupedAggregateRegionObserver: IOException during rebuilding: org.apache.hadoop.hbase.DoNotRetryIOException: Failed 4 actions: Table 'CITY_I' was not found, got: ATLAS_ENTITY_AUDIT_EVENTS.: 4 times, at org.apache.phoenix.util.ServerUtil.createIOException(ServerUtil.java:77) Caused by: org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException: Failed 4 actions: Table 'CITY_I' was not found, got: ATLAS_ENTITY_AUDIT_EVENTS.: 4 times, at org.apache.hadoop.hbase.client.AsyncProcess$BatchErrors.makeException(AsyncProcess.java:235)
There is no zookeeper log.
Created 05-23-2018 03:23 PM
Please ensure your Atlas, zookeeper and hbase is up and running the table ATLAS_ENTITY_AUDIT_EVENTS
Check the entries in zookeeper see my entries
./bin/zkCli.sh Connecting to localhost:2181 Welcome to ZooKeeper! 2018-05-23 16:45:58,797 - INFO [main-SendThread(localhost:2181):ClientCnxn$SendThread@1019] - Opening socket connection to server localhost/0:0:0:0:0:0:0:1:2181. Will not attempt to authenticate using SASL (unknown error) JLine support is enabled ...... ....... [zk: localhost:2181(CONNECTED) 0] ls / [registry, cluster, brokers, storm, zookeeper, infra-solr, hbase-unsecure, admin, isr_change_notification, templeton-hadoop, hiveserver2, controller_epoch, druid, rmstore, ambari-metrics-cluster, consumers, config] [zk: localhost:2181(CONNECTED) 1] ls /hbase-unsecure/table [ATLAS_ENTITY_AUDIT_EVENTS, hbase:meta, hbase:namespace, atlas_titan, hbase:acl] [zk: localhost:2181(CONNECTED) 2]
From your HBase shell
hbase(main):001:0> list TABLE ATLAS_ENTITY_AUDIT_EVENTS atlas_titan 2 row(s) in 18.5760 seconds => ["ATLAS_ENTITY_AUDIT_EVENTS", "atlas_titan"] hbase(main):002:0>
disable 'atlas_titan'
drop 'atlas_titan'
The above steps can be repeated for 'ATLAS_ENTITY_AUDIT_EVENTS' table if there is a requirement to wipe-out audit data as well.
This above steps should reset atlas and start it as if it is a fresh installation
Hope that helps
Created 05-24-2018 11:01 AM
Thank you very much @Geoffrey Shelton Okot for your time. I have followed your suggestion step by step except Atlas. I don't have Atlas, even so, I have dropped the 'atlas_titan' table in HBase. I will monitor RegionServers behaviours from now on. I will notify the results in here.
Created 05-24-2018 11:20 AM
So has the regional server started successfully?
Created 05-28-2018 08:53 AM
Yes it did. But it stopped saturday morning again. Some of logs:
- org.apache.hadoop.hbase.DoNotRetryIOException: hconnection-0x55af174a closed - ERROR [phoenix-update-statistics-3] stats.StatisticsScanner: Failed to update statistics table! org.apache.hadoop.hbase.DoNotRetryIOException: hconnection-0x10e9f278 closed - ERROR [main] regionserver.HRegionServerCommandLine: Region server exiting java.lang.RuntimeException: HRegionServer Aborted - java.io.IOException: Connection reset by peer - org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase/hbaseid