Support Questions
Find answers, ask questions, and share your expertise

HBase Region Servers shutdown after a while

HBase Region Servers shutdown after a while

Explorer

I have a simple HBase cluster with 3 nodes. Each node has only 16GB memory.

I get below errors and then the Regionserver shutdown automatically.

2018-07-10 00:31:11,184 ERROR [RS_CLOSE_REGION-XXXX:16020-2] regionserver.HRegion: Memstore size is XXXXXX


2018-07-10 00:31:14,468 ERROR [sync.2] wal.FSHLog: Error syncing, request close of WAL
java.io.IOException: Failed to replace a bad datanode on the existing pipeline due to no more good datanodes being available to try

Here is the log in full before shutdown hook triggered.

2018-07-10 00:31:11,183 INFO  [RS_CLOSE_REGION-XXXX:16020-0] regionserver.HRegion: Closed ANX_X1234_ORG_XXXX_ANALYTICS_HUB_STREAM_PROCESSEDSTATISTICS,af255818-e3ab-b48b-4182-dd292c0156d5,152332828601
9.65928cb6d569e0c70ede8815dfbaa169.
2018-07-10 00:31:11,184 INFO  [StoreCloserThread-ANX_X1000___INDEX_STAGING_DATA__1_23AC161C-6BFE-486E-83A6-1B6A2E7F32E7,81167a42-bb2b-a385-d70b-f8cd2c82befb,1523990247486.12ad2399cd12b097e416b3c08c836234.
-1] regionserver.HStore: Closed carbon-analytics-data
2018-07-10 00:31:11,184 INFO  [StoreCloserThread-ANX_X1234_ORG_XXXX_ANALYTICS_HUB_STREAM_PROCESSEDSTATISTICS,1f936a13-5caf-44ee-1c0d-c7eee6adfa5d,1523186311575.4a1a3fd2b59c50af9e3c7cc6c5028121.-1] re
gionserver.HStore: Closed carbon-analytics-data
2018-07-10 00:31:11,184 ERROR [RS_CLOSE_REGION-XXXX:16020-1] regionserver.HRegion: Memstore size is 3733952
2018-07-10 00:31:11,184 INFO  [RS_CLOSE_REGION-XXXX:16020-1] regionserver.HRegion: Closed ANX_X1234_ORG_XXXX_ANALYTICS_HUB_STREAM_PROCESSEDSTATISTICS,1f936a13-5caf-44ee-1c0d-c7eee6adfa5d,152318631157
5.4a1a3fd2b59c50af9e3c7cc6c5028121.
2018-07-10 00:31:11,184 ERROR [RS_CLOSE_REGION-XXXX:16020-2] regionserver.HRegion: Memstore size is 833544
2018-07-10 00:31:11,184 INFO  [RS_CLOSE_REGION-XXXX:16020-2] regionserver.HRegion: Closed ANX_X1000___INDEX_STAGING_DATA__1_23AC161C-6BFE-486E-83A6-1B6A2E7F32E7,81167a42-bb2b-a385-d70b-f8cd2c82befb,152399
0247486.12ad2399cd12b097e416b3c08c836234.
2018-07-10 00:31:11,185 INFO  [StoreCloserThread-ANX_X1234_ORG_WSO2_APIMGT_STATISTICS_RESPONSE,,1514542454206.42c97d2efc0cf0e55b79cae2c50b30be.-1] regionserver.HStore: Closed carbon-analytics-data
2018-07-10 00:31:11,185 INFO  [RS_CLOSE_REGION-XXXX:16020-0] regionserver.HRegion: Closed ANX_X1234_ORG_WSO2_APIMGT_STATISTICS_RESPONSE,,1514542454206.42c97d2efc0cf0e55b79cae2c50b30be.
2018-07-10 00:31:11,185 INFO  [StoreCloserThread-ANX_X1234_ORG_XXXX_ANALYTICS_HUB_STREAM_PROCESSEDSTATISTICS,ebc9678c-e35a-4096-bb26-c9e1cbcf82fd,1523522965118.07ef074275e9cc613334bb067098af41.-1] re
gionserver.HStore: Closed carbon-analytics-data
2018-07-10 00:31:11,186 ERROR [RS_CLOSE_REGION-XXXX:16020-2] regionserver.HRegion: Memstore size is 5080768
2018-07-10 00:31:11,186 INFO  [RS_CLOSE_REGION-XXXX:16020-2] regionserver.HRegion: Closed ANX_X1234_ORG_XXXX_ANALYTICS_HUB_STREAM_PROCESSEDSTATISTICS,ebc9678c-e35a-4096-bb26-c9e1cbcf82fd,152352296511
8.07ef074275e9cc613334bb067098af41.
2018-07-10 00:31:11,186 INFO  [StoreCloserThread-IDX_X1234_ORG_XXXX_ANALYTICS_HUB_STREAM_REPONSETIME_SUMMARY_PER_DAY,,1525628114343.c9c49d54c6e5a9fed1ec7d66f09133c2.-1] regionserver.HStore: Closed ca
rbon-analytics-index
2018-07-10 00:31:11,186 INFO  [StoreCloserThread-ANX_X1234_ORG_XXXX_ANALYTICS_HUB_STREAM_OPERATOR_SUMMARY,,1514542511538.c548a01476c09769345b1318f41b1b3b.-1] regionserver.HStore: Closed carbon-analyt
ics-data
2018-07-10 00:31:11,186 INFO  [RS_CLOSE_REGION-XXXX:16020-0] regionserver.HRegion: Closed IDX_X1234_ORG_XXXX_ANALYTICS_HUB_STREAM_REPONSETIME_SUMMARY_PER_DAY,,1525628114343.c9c49d54c6e5a9fed1ec7d66f0
9133c2.
2018-07-10 00:31:11,187 ERROR [RS_CLOSE_REGION-XXXX:16020-1] regionserver.HRegion: Memstore size is 488
2018-07-10 00:31:11,187 INFO  [RS_CLOSE_REGION-XXXX:16020-1] regionserver.HRegion: Closed ANX_X1234_ORG_XXXX_ANALYTICS_HUB_STREAM_OPERATOR_SUMMARY,,1514542511538.c548a01476c09769345b1318f41b1b3b.
2018-07-10 00:31:11,201 INFO  [regionserver/XXXX.hbase02.com/172.26.76.78:16020] regionserver.HRegionServer: stopping server XXXX.hbase02.com,16020,1523886890292; all regions closed.
2018-07-10 00:31:14,449 WARN  [RS_OPEN_META-XXXX:16020-0-MetaLogRoller] hdfs.DFSClient: Slow waitForAckedSeqno took 63148ms (threshold=30000ms)
2018-07-10 00:31:14,460 INFO  [RS_OPEN_META-XXXX:16020-0-MetaLogRoller] wal.FSHLog: Slow sync cost: 63158 ms, current pipeline: [DatanodeInfoWithStorage[172.26.76.78:50010,DS-c8bdc5a8-0cef-4dc8-b652-b7142
9e2f728,DISK], DatanodeInfoWithStorage[172.26.76.79:50010,DS-d86c5793-44d8-4826-a233-744319ef269d,DISK]]
2018-07-10 00:31:14,468 ERROR [sync.2] wal.FSHLog: Error syncing, request close of WAL
java.io.IOException: Failed to replace a bad datanode on the existing pipeline due to no more good datanodes being available to try. (Nodes: current=[DatanodeInfoWithStorage[172.26.76.78:50010,DS-c8bdc5a8
-0cef-4dc8-b652-b71429e2f728,DISK], DatanodeInfoWithStorage[172.26.76.79:50010,DS-d86c5793-44d8-4826-a233-744319ef269d,DISK]], original=[DatanodeInfoWithStorage[172.26.76.78:50010,DS-c8bdc5a8-0cef-4dc8-b6
52-b71429e2f728,DISK], DatanodeInfoWithStorage[172.26.76.79:50010,DS-d86c5793-44d8-4826-a233-744319ef269d,DISK]]). The current failed datanode replacement policy is DEFAULT, and a client may configure thi
s via 'dfs.client.block.write.replace-datanode-on-failure.policy' in its configuration.
	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.findNewDatanode(DFSOutputStream.java:1059)
	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:1122)
	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1280)
	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:1005)
	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:512)
2018-07-10 00:31:14,535 WARN  [RS_OPEN_META-XXXX:16020-0-MetaLogRoller] wal.FSHLog: Failed sync-before-close but no outstanding appends; closing WAL: java.io.IOException: Failed to replace a bad datanode 
on the existing pipeline due to no more good datanodes being available to try. (Nodes: current=[DatanodeInfoWithStorage[172.26.76.78:50010,DS-c8bdc5a8-0cef-4dc8-b652-b71429e2f728,DISK], DatanodeInfoWithSt
orage[172.26.76.79:50010,DS-d86c5793-44d8-4826-a233-744319ef269d,DISK]], original=[DatanodeInfoWithStorage[172.26.76.78:50010,DS-c8bdc5a8-0cef-4dc8-b652-b71429e2f728,DISK], DatanodeInfoWithStorage[172.26.
76.79:50010,DS-d86c5793-44d8-4826-a233-744319ef269d,DISK]]). The current failed datanode replacement policy is DEFAULT, and a client may configure this via 'dfs.client.block.write.replace-datanode-on-fail
ure.policy' in its configuration.
2018-07-10 00:31:14,536 WARN  [RS_OPEN_META-XXXX:16020-0-MetaLogRoller] wal.ProtobufLogWriter: Failed to write trailer, non-fatal, continuing...
java.io.IOException: Failed to replace a bad datanode on the existing pipeline due to no more good datanodes being available to try. (Nodes: current=[DatanodeInfoWithStorage[172.26.76.78:50010,DS-c8bdc5a8
-0cef-4dc8-b652-b71429e2f728,DISK], DatanodeInfoWithStorage[172.26.76.79:50010,DS-d86c5793-44d8-4826-a233-744319ef269d,DISK]], original=[DatanodeInfoWithStorage[172.26.76.78:50010,DS-c8bdc5a8-0cef-4dc8-b6
52-b71429e2f728,DISK], DatanodeInfoWithStorage[172.26.76.79:50010,DS-d86c5793-44d8-4826-a233-744319ef269d,DISK]]). The current failed datanode replacement policy is DEFAULT, and a client may configure thi
s via 'dfs.client.block.write.replace-datanode-on-failure.policy' in its configuration.
	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.findNewDatanode(DFSOutputStream.java:1059)
	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:1122)
	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1280)
	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:1005)
	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:512)
2018-07-10 00:31:14,536 WARN  [RS_OPEN_META-XXXX:16020-0-MetaLogRoller] wal.FSHLog: Riding over failed WAL close of hdfs://XXXX/apps/hbase/data/WALs/XXXX.hbase02.com,16020,1523886890292/XXXX.hbase02.com%2
C16020%2C1523886890292..meta.1531159209078.meta, cause="Failed to replace a bad datanode on the existing pipeline due to no more good datanodes being available to try. (Nodes: current=[DatanodeInfoWithSto
rage[172.26.76.78:50010,DS-c8bdc5a8-0cef-4dc8-b652-b71429e2f728,DISK], DatanodeInfoWithStorage[172.26.76.79:50010,DS-d86c5793-44d8-4826-a233-744319ef269d,DISK]], original=[DatanodeInfoWithStorage[172.26.7
6.78:50010,DS-c8bdc5a8-0cef-4dc8-b652-b71429e2f728,DISK], DatanodeInfoWithStorage[172.26.76.79:50010,DS-d86c5793-44d8-4826-a233-744319ef269d,DISK]]). The current failed datanode replacement policy is DEFA
ULT, and a client may configure this via 'dfs.client.block.write.replace-datanode-on-failure.policy' in its configuration.", errors=1; THIS FILE WAS NOT CLOSED BUT ALL EDITS SYNCED SO SHOULD BE OK
2018-07-10 00:31:14,542 INFO  [RS_OPEN_META-XXXX:16020-0-MetaLogRoller] wal.FSHLog: Rolled WAL /apps/hbase/data/WALs/XXXX.hbase02.com,16020,1523886890292/XXXX.hbase02.com%2C16020%2C1523886890292..meta.153
1159209078.meta with entries=0, filesize=83 B; new WAL /apps/hbase/data/WALs/XXXX.hbase02.com,16020,1523886890292/XXXX.hbase02.com%2C16020%2C1523886890292..meta.1531162811268.meta
2018-07-10 00:31:14,543 INFO  [RS_OPEN_META-XXXX:16020-0-MetaLogRoller] wal.FSHLog: Archiving hdfs://XXXX/apps/hbase/data/WALs/XXXX.hbase02.com,16020,1523886890292/XXXX.hbase02.com%2C16020%2C1523886890292
..meta.1531159209078.meta to hdfs://XXXX/apps/hbase/data/oldWALs/XXXX.hbase02.com%2C16020%2C1523886890292..meta.1531159209078.meta
2018-07-10 00:31:14,554 INFO  [RS_OPEN_META-XXXX:16020-0-MetaLogRoller] regionserver.LogRoller: LogRoller exiting.
2018-07-10 00:31:14,592 WARN  [regionserver/XXXX.hbase02.com/172.26.76.78:16020] wal.ProtobufLogWriter: Failed to write trailer, non-fatal, continuing...
java.io.IOException: Failed to replace a bad datanode on the existing pipeline due to no more good datanodes being available to try. (Nodes: current=[DatanodeInfoWithStorage[172.26.76.78:50010,DS-c8bdc5a8
-0cef-4dc8-b652-b71429e2f728,DISK], DatanodeInfoWithStorage[172.26.76.79:50010,DS-d86c5793-44d8-4826-a233-744319ef269d,DISK]], original=[DatanodeInfoWithStorage[172.26.76.78:50010,DS-c8bdc5a8-0cef-4dc8-b6
52-b71429e2f728,DISK], DatanodeInfoWithStorage[172.26.76.79:50010,DS-d86c5793-44d8-4826-a233-744319ef269d,DISK]]). The current failed datanode replacement policy is DEFAULT, and a client may configure thi
s via 'dfs.client.block.write.replace-datanode-on-failure.policy' in its configuration.
	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.findNewDatanode(DFSOutputStream.java:1059)
	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:1122)
	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1280)
	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:1005)
	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:512)
2018-07-10 00:31:14,593 ERROR [regionserver/XXXX.hbase02.com/172.26.76.78:16020] regionserver.HRegionServer: Shutdown / close of WAL failed: java.io.IOException: Failed to replace a bad datanode on the ex
isting pipeline due to no more good datanodes being available to try. (Nodes: current=[DatanodeInfoWithStorage[172.26.76.78:50010,DS-c8bdc5a8-0cef-4dc8-b652-b71429e2f728,DISK], DatanodeInfoWithStorage[172
.26.76.79:50010,DS-d86c5793-44d8-4826-a233-744319ef269d,DISK]], original=[DatanodeInfoWithStorage[172.26.76.78:50010,DS-c8bdc5a8-0cef-4dc8-b652-b71429e2f728,DISK], DatanodeInfoWithStorage[172.26.76.79:500
10,DS-d86c5793-44d8-4826-a233-744319ef269d,DISK]]). The current failed datanode replacement policy is DEFAULT, and a client may configure this via 'dfs.client.block.write.replace-datanode-on-failure.polic
y' in its configuration.
2018-07-10 00:31:14,603 INFO  [regionserver/XXXX.hbase02.com/172.26.76.78:16020] regionserver.Leases: regionserver/XXXX.hbase02.com/172.26.76.78:16020 closing leases
2018-07-10 00:31:14,675 INFO  [regionserver/XXXX.hbase02.com/172.26.76.78:16020] regionserver.Leases: regionserver/XXXX.hbase02.com/172.26.76.78:16020 closed leases
2018-07-10 00:31:14,682 INFO  [regionserver/XXXX.hbase02.com/172.26.76.78:16020] hbase.ChoreService: Chore service for: XXXX.hbase02.com,16020,1523886890292 had [[ScheduledChore: Name: MovedRegionsCleaner
 for region XXXX.hbase02.com,16020,1523886890292 Period: 120000 Unit: MILLISECONDS]] on shutdown
2018-07-10 00:31:14,816 INFO  [regionserver/XXXX.hbase02.com/172.26.76.78:16020] ipc.RpcServer: Stopping server on 16020
2018-07-10 00:31:14,816 INFO  [RpcServer.listener,port=16020] ipc.RpcServer: RpcServer.listener,port=16020: stopping
2018-07-10 00:31:14,821 INFO  [RpcServer.responder] ipc.RpcServer: RpcServer.responder: stopped
2018-07-10 00:31:14,821 INFO  [RpcServer.responder] ipc.RpcServer: RpcServer.responder: stopping
2018-07-10 00:31:14,893 INFO  [regionserver/XXXX.hbase02.com/172.26.76.78:16020] zookeeper.ZooKeeper: Session: 0x1629a093dc7019b closed
2018-07-10 00:31:14,893 INFO  [regionserver/XXXX.hbase02.com/172.26.76.78:16020] regionserver.HRegionServer: stopping server XXXX.hbase02.com,16020,1523886890292; zookeeper connection closed.
2018-07-10 00:31:14,893 INFO  [regionserver/XXXX.hbase02.com/172.26.76.78:16020] regionserver.HRegionServer: regionserver/XXXX.hbase02.com/172.26.76.78:16020 exiting
2018-07-10 00:31:14,894 INFO  [main-EventThread] zookeeper.ClientCnxn: EventThread shut down
2018-07-10 00:31:15,670 INFO  [HBase-Metrics2-1] impl.MetricsSystemImpl: Stopping HBase metrics system...
2018-07-10 00:31:15,670 INFO  [timeline] impl.MetricsSinkAdapter: timeline thread interrupted.
2018-07-10 00:31:16,184 INFO  [HBase-Metrics2-1] impl.MetricsSystemImpl: HBase metrics system stopped.
2018-07-10 00:31:16,458 ERROR [main] regionserver.HRegionServerCommandLine: Region server exiting
java.lang.RuntimeException: HRegionServer Aborted
	at org.apache.hadoop.hbase.regionserver.HRegionServerCommandLine.start(HRegionServerCommandLine.java:68)
	at org.apache.hadoop.hbase.regionserver.HRegionServerCommandLine.run(HRegionServerCommandLine.java:87)
	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
	at org.apache.hadoop.hbase.util.ServerCommandLine.doMain(ServerCommandLine.java:126)
	at org.apache.hadoop.hbase.regionserver.HRegionServer.main(HRegionServer.java:2832)
2018-07-10 00:31:16,687 INFO  [HBase-Metrics2-1] impl.MetricsConfig: loaded properties from hadoop-metrics2-hbase.properties
2018-07-10 00:31:16,819 INFO  [pool-4-thread-1] regionserver.ShutdownHook: Shutdown hook starting; hbase.shutdown.hook=true; fsShutdownHook=org.apache.hadoop.fs.FileSystem$Cache$ClientFinalizer@324c64cd
2018-07-10 00:31:16,819 INFO  [pool-4-thread-1] regionserver.ShutdownHook: Starting fs shutdown hook thread.
2018-07-10 00:31:16,873 INFO  [pool-4-thread-1] regionserver.ShutdownHook: Shutdown hook finished.


Please help me to find out the root cause for this.

1 REPLY 1

Re: HBase Region Servers shutdown after a while

You have issues where DataNodes are being marked as dead which causes HBase to be unable to reach the required level of replication that you have configured. Inspect why the DataNodes are being marked as failures: JVM gc pauses or networking issues are common suspects.