Created 09-04-2018 03:18 PM
I have a simple HBase cluster with 3 nodes. Each node has only 16GB memory.
I get below errors and then the Regionserver shutdown automatically.
2018-07-10 00:31:11,184 ERROR [RS_CLOSE_REGION-XXXX:16020-2] regionserver.HRegion: Memstore size is XXXXXX 2018-07-10 00:31:14,468 ERROR [sync.2] wal.FSHLog: Error syncing, request close of WAL java.io.IOException: Failed to replace a bad datanode on the existing pipeline due to no more good datanodes being available to try
Here is the log in full before shutdown hook triggered.
2018-07-10 00:31:11,183 INFO [RS_CLOSE_REGION-XXXX:16020-0] regionserver.HRegion: Closed ANX_X1234_ORG_XXXX_ANALYTICS_HUB_STREAM_PROCESSEDSTATISTICS,af255818-e3ab-b48b-4182-dd292c0156d5,152332828601 9.65928cb6d569e0c70ede8815dfbaa169. 2018-07-10 00:31:11,184 INFO [StoreCloserThread-ANX_X1000___INDEX_STAGING_DATA__1_23AC161C-6BFE-486E-83A6-1B6A2E7F32E7,81167a42-bb2b-a385-d70b-f8cd2c82befb,1523990247486.12ad2399cd12b097e416b3c08c836234. -1] regionserver.HStore: Closed carbon-analytics-data 2018-07-10 00:31:11,184 INFO [StoreCloserThread-ANX_X1234_ORG_XXXX_ANALYTICS_HUB_STREAM_PROCESSEDSTATISTICS,1f936a13-5caf-44ee-1c0d-c7eee6adfa5d,1523186311575.4a1a3fd2b59c50af9e3c7cc6c5028121.-1] re gionserver.HStore: Closed carbon-analytics-data 2018-07-10 00:31:11,184 ERROR [RS_CLOSE_REGION-XXXX:16020-1] regionserver.HRegion: Memstore size is 3733952 2018-07-10 00:31:11,184 INFO [RS_CLOSE_REGION-XXXX:16020-1] regionserver.HRegion: Closed ANX_X1234_ORG_XXXX_ANALYTICS_HUB_STREAM_PROCESSEDSTATISTICS,1f936a13-5caf-44ee-1c0d-c7eee6adfa5d,152318631157 5.4a1a3fd2b59c50af9e3c7cc6c5028121. 2018-07-10 00:31:11,184 ERROR [RS_CLOSE_REGION-XXXX:16020-2] regionserver.HRegion: Memstore size is 833544 2018-07-10 00:31:11,184 INFO [RS_CLOSE_REGION-XXXX:16020-2] regionserver.HRegion: Closed ANX_X1000___INDEX_STAGING_DATA__1_23AC161C-6BFE-486E-83A6-1B6A2E7F32E7,81167a42-bb2b-a385-d70b-f8cd2c82befb,152399 0247486.12ad2399cd12b097e416b3c08c836234. 2018-07-10 00:31:11,185 INFO [StoreCloserThread-ANX_X1234_ORG_WSO2_APIMGT_STATISTICS_RESPONSE,,1514542454206.42c97d2efc0cf0e55b79cae2c50b30be.-1] regionserver.HStore: Closed carbon-analytics-data 2018-07-10 00:31:11,185 INFO [RS_CLOSE_REGION-XXXX:16020-0] regionserver.HRegion: Closed ANX_X1234_ORG_WSO2_APIMGT_STATISTICS_RESPONSE,,1514542454206.42c97d2efc0cf0e55b79cae2c50b30be. 2018-07-10 00:31:11,185 INFO [StoreCloserThread-ANX_X1234_ORG_XXXX_ANALYTICS_HUB_STREAM_PROCESSEDSTATISTICS,ebc9678c-e35a-4096-bb26-c9e1cbcf82fd,1523522965118.07ef074275e9cc613334bb067098af41.-1] re gionserver.HStore: Closed carbon-analytics-data 2018-07-10 00:31:11,186 ERROR [RS_CLOSE_REGION-XXXX:16020-2] regionserver.HRegion: Memstore size is 5080768 2018-07-10 00:31:11,186 INFO [RS_CLOSE_REGION-XXXX:16020-2] regionserver.HRegion: Closed ANX_X1234_ORG_XXXX_ANALYTICS_HUB_STREAM_PROCESSEDSTATISTICS,ebc9678c-e35a-4096-bb26-c9e1cbcf82fd,152352296511 8.07ef074275e9cc613334bb067098af41. 2018-07-10 00:31:11,186 INFO [StoreCloserThread-IDX_X1234_ORG_XXXX_ANALYTICS_HUB_STREAM_REPONSETIME_SUMMARY_PER_DAY,,1525628114343.c9c49d54c6e5a9fed1ec7d66f09133c2.-1] regionserver.HStore: Closed ca rbon-analytics-index 2018-07-10 00:31:11,186 INFO [StoreCloserThread-ANX_X1234_ORG_XXXX_ANALYTICS_HUB_STREAM_OPERATOR_SUMMARY,,1514542511538.c548a01476c09769345b1318f41b1b3b.-1] regionserver.HStore: Closed carbon-analyt ics-data 2018-07-10 00:31:11,186 INFO [RS_CLOSE_REGION-XXXX:16020-0] regionserver.HRegion: Closed IDX_X1234_ORG_XXXX_ANALYTICS_HUB_STREAM_REPONSETIME_SUMMARY_PER_DAY,,1525628114343.c9c49d54c6e5a9fed1ec7d66f0 9133c2. 2018-07-10 00:31:11,187 ERROR [RS_CLOSE_REGION-XXXX:16020-1] regionserver.HRegion: Memstore size is 488 2018-07-10 00:31:11,187 INFO [RS_CLOSE_REGION-XXXX:16020-1] regionserver.HRegion: Closed ANX_X1234_ORG_XXXX_ANALYTICS_HUB_STREAM_OPERATOR_SUMMARY,,1514542511538.c548a01476c09769345b1318f41b1b3b. 2018-07-10 00:31:11,201 INFO [regionserver/XXXX.hbase02.com/172.26.76.78:16020] regionserver.HRegionServer: stopping server XXXX.hbase02.com,16020,1523886890292; all regions closed. 2018-07-10 00:31:14,449 WARN [RS_OPEN_META-XXXX:16020-0-MetaLogRoller] hdfs.DFSClient: Slow waitForAckedSeqno took 63148ms (threshold=30000ms) 2018-07-10 00:31:14,460 INFO [RS_OPEN_META-XXXX:16020-0-MetaLogRoller] wal.FSHLog: Slow sync cost: 63158 ms, current pipeline: [DatanodeInfoWithStorage[172.26.76.78:50010,DS-c8bdc5a8-0cef-4dc8-b652-b7142 9e2f728,DISK], DatanodeInfoWithStorage[172.26.76.79:50010,DS-d86c5793-44d8-4826-a233-744319ef269d,DISK]] 2018-07-10 00:31:14,468 ERROR [sync.2] wal.FSHLog: Error syncing, request close of WAL java.io.IOException: Failed to replace a bad datanode on the existing pipeline due to no more good datanodes being available to try. (Nodes: current=[DatanodeInfoWithStorage[172.26.76.78:50010,DS-c8bdc5a8 -0cef-4dc8-b652-b71429e2f728,DISK], DatanodeInfoWithStorage[172.26.76.79:50010,DS-d86c5793-44d8-4826-a233-744319ef269d,DISK]], original=[DatanodeInfoWithStorage[172.26.76.78:50010,DS-c8bdc5a8-0cef-4dc8-b6 52-b71429e2f728,DISK], DatanodeInfoWithStorage[172.26.76.79:50010,DS-d86c5793-44d8-4826-a233-744319ef269d,DISK]]). The current failed datanode replacement policy is DEFAULT, and a client may configure thi s via 'dfs.client.block.write.replace-datanode-on-failure.policy' in its configuration. at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.findNewDatanode(DFSOutputStream.java:1059) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:1122) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1280) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:1005) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:512) 2018-07-10 00:31:14,535 WARN [RS_OPEN_META-XXXX:16020-0-MetaLogRoller] wal.FSHLog: Failed sync-before-close but no outstanding appends; closing WAL: java.io.IOException: Failed to replace a bad datanode on the existing pipeline due to no more good datanodes being available to try. (Nodes: current=[DatanodeInfoWithStorage[172.26.76.78:50010,DS-c8bdc5a8-0cef-4dc8-b652-b71429e2f728,DISK], DatanodeInfoWithSt orage[172.26.76.79:50010,DS-d86c5793-44d8-4826-a233-744319ef269d,DISK]], original=[DatanodeInfoWithStorage[172.26.76.78:50010,DS-c8bdc5a8-0cef-4dc8-b652-b71429e2f728,DISK], DatanodeInfoWithStorage[172.26. 76.79:50010,DS-d86c5793-44d8-4826-a233-744319ef269d,DISK]]). The current failed datanode replacement policy is DEFAULT, and a client may configure this via 'dfs.client.block.write.replace-datanode-on-fail ure.policy' in its configuration. 2018-07-10 00:31:14,536 WARN [RS_OPEN_META-XXXX:16020-0-MetaLogRoller] wal.ProtobufLogWriter: Failed to write trailer, non-fatal, continuing... java.io.IOException: Failed to replace a bad datanode on the existing pipeline due to no more good datanodes being available to try. (Nodes: current=[DatanodeInfoWithStorage[172.26.76.78:50010,DS-c8bdc5a8 -0cef-4dc8-b652-b71429e2f728,DISK], DatanodeInfoWithStorage[172.26.76.79:50010,DS-d86c5793-44d8-4826-a233-744319ef269d,DISK]], original=[DatanodeInfoWithStorage[172.26.76.78:50010,DS-c8bdc5a8-0cef-4dc8-b6 52-b71429e2f728,DISK], DatanodeInfoWithStorage[172.26.76.79:50010,DS-d86c5793-44d8-4826-a233-744319ef269d,DISK]]). The current failed datanode replacement policy is DEFAULT, and a client may configure thi s via 'dfs.client.block.write.replace-datanode-on-failure.policy' in its configuration. at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.findNewDatanode(DFSOutputStream.java:1059) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:1122) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1280) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:1005) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:512) 2018-07-10 00:31:14,536 WARN [RS_OPEN_META-XXXX:16020-0-MetaLogRoller] wal.FSHLog: Riding over failed WAL close of hdfs://XXXX/apps/hbase/data/WALs/XXXX.hbase02.com,16020,1523886890292/XXXX.hbase02.com%2 C16020%2C1523886890292..meta.1531159209078.meta, cause="Failed to replace a bad datanode on the existing pipeline due to no more good datanodes being available to try. (Nodes: current=[DatanodeInfoWithSto rage[172.26.76.78:50010,DS-c8bdc5a8-0cef-4dc8-b652-b71429e2f728,DISK], DatanodeInfoWithStorage[172.26.76.79:50010,DS-d86c5793-44d8-4826-a233-744319ef269d,DISK]], original=[DatanodeInfoWithStorage[172.26.7 6.78:50010,DS-c8bdc5a8-0cef-4dc8-b652-b71429e2f728,DISK], DatanodeInfoWithStorage[172.26.76.79:50010,DS-d86c5793-44d8-4826-a233-744319ef269d,DISK]]). The current failed datanode replacement policy is DEFA ULT, and a client may configure this via 'dfs.client.block.write.replace-datanode-on-failure.policy' in its configuration.", errors=1; THIS FILE WAS NOT CLOSED BUT ALL EDITS SYNCED SO SHOULD BE OK 2018-07-10 00:31:14,542 INFO [RS_OPEN_META-XXXX:16020-0-MetaLogRoller] wal.FSHLog: Rolled WAL /apps/hbase/data/WALs/XXXX.hbase02.com,16020,1523886890292/XXXX.hbase02.com%2C16020%2C1523886890292..meta.153 1159209078.meta with entries=0, filesize=83 B; new WAL /apps/hbase/data/WALs/XXXX.hbase02.com,16020,1523886890292/XXXX.hbase02.com%2C16020%2C1523886890292..meta.1531162811268.meta 2018-07-10 00:31:14,543 INFO [RS_OPEN_META-XXXX:16020-0-MetaLogRoller] wal.FSHLog: Archiving hdfs://XXXX/apps/hbase/data/WALs/XXXX.hbase02.com,16020,1523886890292/XXXX.hbase02.com%2C16020%2C1523886890292 ..meta.1531159209078.meta to hdfs://XXXX/apps/hbase/data/oldWALs/XXXX.hbase02.com%2C16020%2C1523886890292..meta.1531159209078.meta 2018-07-10 00:31:14,554 INFO [RS_OPEN_META-XXXX:16020-0-MetaLogRoller] regionserver.LogRoller: LogRoller exiting. 2018-07-10 00:31:14,592 WARN [regionserver/XXXX.hbase02.com/172.26.76.78:16020] wal.ProtobufLogWriter: Failed to write trailer, non-fatal, continuing... java.io.IOException: Failed to replace a bad datanode on the existing pipeline due to no more good datanodes being available to try. (Nodes: current=[DatanodeInfoWithStorage[172.26.76.78:50010,DS-c8bdc5a8 -0cef-4dc8-b652-b71429e2f728,DISK], DatanodeInfoWithStorage[172.26.76.79:50010,DS-d86c5793-44d8-4826-a233-744319ef269d,DISK]], original=[DatanodeInfoWithStorage[172.26.76.78:50010,DS-c8bdc5a8-0cef-4dc8-b6 52-b71429e2f728,DISK], DatanodeInfoWithStorage[172.26.76.79:50010,DS-d86c5793-44d8-4826-a233-744319ef269d,DISK]]). The current failed datanode replacement policy is DEFAULT, and a client may configure thi s via 'dfs.client.block.write.replace-datanode-on-failure.policy' in its configuration. at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.findNewDatanode(DFSOutputStream.java:1059) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:1122) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1280) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:1005) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:512) 2018-07-10 00:31:14,593 ERROR [regionserver/XXXX.hbase02.com/172.26.76.78:16020] regionserver.HRegionServer: Shutdown / close of WAL failed: java.io.IOException: Failed to replace a bad datanode on the ex isting pipeline due to no more good datanodes being available to try. (Nodes: current=[DatanodeInfoWithStorage[172.26.76.78:50010,DS-c8bdc5a8-0cef-4dc8-b652-b71429e2f728,DISK], DatanodeInfoWithStorage[172 .26.76.79:50010,DS-d86c5793-44d8-4826-a233-744319ef269d,DISK]], original=[DatanodeInfoWithStorage[172.26.76.78:50010,DS-c8bdc5a8-0cef-4dc8-b652-b71429e2f728,DISK], DatanodeInfoWithStorage[172.26.76.79:500 10,DS-d86c5793-44d8-4826-a233-744319ef269d,DISK]]). The current failed datanode replacement policy is DEFAULT, and a client may configure this via 'dfs.client.block.write.replace-datanode-on-failure.polic y' in its configuration. 2018-07-10 00:31:14,603 INFO [regionserver/XXXX.hbase02.com/172.26.76.78:16020] regionserver.Leases: regionserver/XXXX.hbase02.com/172.26.76.78:16020 closing leases 2018-07-10 00:31:14,675 INFO [regionserver/XXXX.hbase02.com/172.26.76.78:16020] regionserver.Leases: regionserver/XXXX.hbase02.com/172.26.76.78:16020 closed leases 2018-07-10 00:31:14,682 INFO [regionserver/XXXX.hbase02.com/172.26.76.78:16020] hbase.ChoreService: Chore service for: XXXX.hbase02.com,16020,1523886890292 had [[ScheduledChore: Name: MovedRegionsCleaner for region XXXX.hbase02.com,16020,1523886890292 Period: 120000 Unit: MILLISECONDS]] on shutdown 2018-07-10 00:31:14,816 INFO [regionserver/XXXX.hbase02.com/172.26.76.78:16020] ipc.RpcServer: Stopping server on 16020 2018-07-10 00:31:14,816 INFO [RpcServer.listener,port=16020] ipc.RpcServer: RpcServer.listener,port=16020: stopping 2018-07-10 00:31:14,821 INFO [RpcServer.responder] ipc.RpcServer: RpcServer.responder: stopped 2018-07-10 00:31:14,821 INFO [RpcServer.responder] ipc.RpcServer: RpcServer.responder: stopping 2018-07-10 00:31:14,893 INFO [regionserver/XXXX.hbase02.com/172.26.76.78:16020] zookeeper.ZooKeeper: Session: 0x1629a093dc7019b closed 2018-07-10 00:31:14,893 INFO [regionserver/XXXX.hbase02.com/172.26.76.78:16020] regionserver.HRegionServer: stopping server XXXX.hbase02.com,16020,1523886890292; zookeeper connection closed. 2018-07-10 00:31:14,893 INFO [regionserver/XXXX.hbase02.com/172.26.76.78:16020] regionserver.HRegionServer: regionserver/XXXX.hbase02.com/172.26.76.78:16020 exiting 2018-07-10 00:31:14,894 INFO [main-EventThread] zookeeper.ClientCnxn: EventThread shut down 2018-07-10 00:31:15,670 INFO [HBase-Metrics2-1] impl.MetricsSystemImpl: Stopping HBase metrics system... 2018-07-10 00:31:15,670 INFO [timeline] impl.MetricsSinkAdapter: timeline thread interrupted. 2018-07-10 00:31:16,184 INFO [HBase-Metrics2-1] impl.MetricsSystemImpl: HBase metrics system stopped. 2018-07-10 00:31:16,458 ERROR [main] regionserver.HRegionServerCommandLine: Region server exiting java.lang.RuntimeException: HRegionServer Aborted at org.apache.hadoop.hbase.regionserver.HRegionServerCommandLine.start(HRegionServerCommandLine.java:68) at org.apache.hadoop.hbase.regionserver.HRegionServerCommandLine.run(HRegionServerCommandLine.java:87) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76) at org.apache.hadoop.hbase.util.ServerCommandLine.doMain(ServerCommandLine.java:126) at org.apache.hadoop.hbase.regionserver.HRegionServer.main(HRegionServer.java:2832) 2018-07-10 00:31:16,687 INFO [HBase-Metrics2-1] impl.MetricsConfig: loaded properties from hadoop-metrics2-hbase.properties 2018-07-10 00:31:16,819 INFO [pool-4-thread-1] regionserver.ShutdownHook: Shutdown hook starting; hbase.shutdown.hook=true; fsShutdownHook=org.apache.hadoop.fs.FileSystem$Cache$ClientFinalizer@324c64cd 2018-07-10 00:31:16,819 INFO [pool-4-thread-1] regionserver.ShutdownHook: Starting fs shutdown hook thread. 2018-07-10 00:31:16,873 INFO [pool-4-thread-1] regionserver.ShutdownHook: Shutdown hook finished.
Please help me to find out the root cause for this.
Created 09-04-2018 08:07 PM
You have issues where DataNodes are being marked as dead which causes HBase to be unable to reach the required level of replication that you have configured. Inspect why the DataNodes are being marked as failures: JVM gc pauses or networking issues are common suspects.