Support Questions
Find answers, ask questions, and share your expertise

Hbase regionserver shutdown after few hours

Explorer

I installed a new hdfs+hbase cluster. But after started for few hours, hbase regionservers all shutdown. There isnt any data writing. 

versions:

HBase  1.2.0-cdh5.10.0

Hadoop  2.6.0-cdh5.10.0

zookeeper 3.4.5-cdh5.10

2021-11-09 21:15:51,928 WARN  [ResponseProcessor for block BP-899263853-9.180.152.33-1634099649590:blk_1073742680_2626] hdfs.DFSClient: DFSOutputStream ResponseProcessor exception  for block BP-899263853-9.180.152.33-1634099649590:blk_1073742680_2626
java.io.EOFException: Premature EOF: no length prefix available
        at org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:2272)
        at org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:235)
        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:1075)
2021-11-09 21:15:51,928 WARN  [DataStreamer for file /hbase/WALs/jfhbase03,60020,1636456250380/jfhbase03%2C60020%2C1636456250380.default.1636463451629 block BP-899263853-9.180.152.33-1634099649590:blk_1073742680_2626] hdfs.DFSClient: Error Recovery for block BP-899263853-9.180.152.33-1634099649590:blk_1073742680_2626 in pipeline DatanodeInfoWithStorage[9.180.152.40:50010,DS-04c3a0da-36df-41b1-98ae-628c357fad41,DISK], DatanodeInfoWithStorage[9.180.152.33:50010,DS-e9e87481-2dc1-46aa-9685-c9e5bd920f9f,DISK], DatanodeInfoWithStorage[9.180.152.39:50010,DS-2cf89d04-34d8-4623-82b2-58daa7ac3e0c,DISK]: bad datanode DatanodeInfoWithStorage[9.180.152.40:50010,DS-04c3a0da-36df-41b1-98ae-628c357fad41,DISK]
2021-11-09 21:16:52,007 WARN  [ResponseProcessor for block BP-899263853-9.180.152.33-1634099649590:blk_1073742680_2634] hdfs.DFSClient: DFSOutputStream ResponseProcessor exception  for block BP-899263853-9.180.152.33-1634099649590:blk_1073742680_2634
java.io.EOFException: Premature EOF: no length prefix available
        at org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:2272)
        at org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:235)
        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:1075)
2021-11-09 21:16:52,008 WARN  [DataStreamer for file /hbase/WALs/jfhbase03,60020,1636456250380/jfhbase03%2C60020%2C1636456250380.default.1636463451629 block BP-899263853-9.180.152.33-1634099649590:blk_1073742680_2634] hdfs.DFSClient: Error recovering pipeline for writing BP-899263853-9.180.152.33-1634099649590:blk_1073742680_2634. Already retried 5 times for the same packet.
2021-11-09 21:17:51,743 INFO  [main-EventThread] replication.ReplicationTrackerZKImpl: /hbase/rs/jfhbase10,60020,1636456250396 znode expired, triggering replicatorRemoved event
2021-11-09 21:18:24,288 INFO  [ReplicationExecutor-0] replication.ReplicationQueuesZKImpl: Atomically moving jfhbase10,60020,1636456250396's WALs to my queue
2021-11-09 21:20:50,426 INFO  [LruBlockCacheStatsExecutor] hfile.LruBlockCache: totalSize=6.64 MB, freeSize=6.32 GB, max=6.33 GB, blockCount=0, accesses=0, hits=0, hitRatio=0, cachingAccesses=0, cachingHits=0, cachingHitsRatio=0,evictions=779, evicted=0, evictedPerRun=0.0
2021-11-09 21:25:50,426 INFO  [LruBlockCacheStatsExecutor] hfile.LruBlockCache: totalSize=6.64 MB, freeSize=6.32 GB, max=6.33 GB, blockCount=0, accesses=0, hits=0, hitRatio=0, cachingAccesses=0, cachingHits=0, cachingHitsRatio=0,evictions=809, evicted=0, evictedPerRun=0.0
2021-11-09 21:30:50,426 INFO  [LruBlockCacheStatsExecutor] hfile.LruBlockCache: totalSize=6.64 MB, freeSize=6.32 GB, max=6.33 GB, blockCount=0, accesses=0, hits=0, hitRatio=0, cachingAccesses=0, cachingHits=0, cachingHitsRatio=0,evictions=839, evicted=0, evictedPerRun=0.0
2021-11-09 21:35:50,426 INFO  [LruBlockCacheStatsExecutor] hfile.LruBlockCache: totalSize=6.64 MB, freeSize=6.32 GB, max=6.33 GB, blockCount=0, accesses=0, hits=0, hitRatio=0, cachingAccesses=0, cachingHits=0, cachingHitsRatio=0,evictions=869, evicted=0, evictedPerRun=0.0
2021-11-09 21:40:50,426 INFO  [LruBlockCacheStatsExecutor] hfile.LruBlockCache: totalSize=6.64 MB, freeSize=6.32 GB, max=6.33 GB, blockCount=0, accesses=0, hits=0, hitRatio=0, cachingAccesses=0, cachingHits=0, cachingHitsRatio=0,evictions=899, evicted=0, evictedPerRun=0.0
2021-11-09 21:45:50,426 INFO  [LruBlockCacheStatsExecutor] hfile.LruBlockCache: totalSize=6.64 MB, freeSize=6.32 GB, max=6.33 GB, blockCount=0, accesses=0, hits=0, hitRatio=0, cachingAccesses=0, cachingHits=0, cachingHitsRatio=0,evictions=929, evicted=0, evictedPerRun=0.0
2021-11-09 21:50:50,426 INFO  [LruBlockCacheStatsExecutor] hfile.LruBlockCache: totalSize=6.64 MB, freeSize=6.32 GB, max=6.33 GB, blockCount=0, accesses=0, hits=0, hitRatio=0, cachingAccesses=0, cachingHits=0, cachingHitsRatio=0,evictions=959, evicted=0, evictedPerRun=0.0
2021-11-09 21:55:50,426 INFO  [LruBlockCacheStatsExecutor] hfile.LruBlockCache: totalSize=6.64 MB, freeSize=6.32 GB, max=6.33 GB, blockCount=0, accesses=0, hits=0, hitRatio=0, cachingAccesses=0, cachingHits=0, cachingHitsRatio=0,evictions=989, evicted=0, evictedPerRun=0.0
2021-11-09 22:00:50,426 INFO  [LruBlockCacheStatsExecutor] hfile.LruBlockCache: totalSize=6.64 MB, freeSize=6.32 GB, max=6.33 GB, blockCount=0, accesses=0, hits=0, hitRatio=0, cachingAccesses=0, cachingHits=0, cachingHitsRatio=0,evictions=1019, evicted=0, evictedPerRun=0.0
2021-11-09 22:05:50,426 INFO  [LruBlockCacheStatsExecutor] hfile.LruBlockCache: totalSize=6.64 MB, freeSize=6.32 GB, max=6.33 GB, blockCount=0, accesses=0, hits=0, hitRatio=0, cachingAccesses=0, cachingHits=0, cachingHitsRatio=0,evictions=1049, evicted=0, evictedPerRun=0.0
2021-11-09 22:10:50,426 INFO  [LruBlockCacheStatsExecutor] hfile.LruBlockCache: totalSize=6.64 MB, freeSize=6.32 GB, max=6.33 GB, blockCount=0, accesses=0, hits=0, hitRatio=0, cachingAccesses=0, cachingHits=0, cachingHitsRatio=0,evictions=1079, evicted=0, evictedPerRun=0.0
2021-11-09 22:10:50,428 INFO  [MobFileCache #0] mob.MobFileCache: MobFileCache Statistics, access: 0, miss: 0, hit: 0, hit ratio: 0%, evicted files: 0
2021-11-09 22:10:51,722 ERROR [sync.3] wal.FSHLog: Error syncing, request close of WAL
java.io.IOException: Failing write. Tried pipeline recovery 5 times without success.
        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:1230)
        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:721)
2021-11-09 22:10:51,722 WARN  [regionserver/jfhbase03/9.180.152.43:60020.logRoller] wal.FSHLog: Failed sync-before-close but no outstanding appends; closing WAL: java.io.IOException: Failing write. Tried pipeline recovery 5 times without success.
2021-11-09 22:10:51,722 WARN  [regionserver/jfhbase03/9.180.152.43:60020.logRoller] wal.ProtobufLogWriter: Failed to write trailer, non-fatal, continuing...
java.io.IOException: Failing write. Tried pipeline recovery 5 times without success.
        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:1230)
        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:721)
2021-11-09 22:10:51,722 ERROR [regionserver/jfhbase03/9.180.152.43:60020.logRoller] wal.FSHLog: Failed close of WAL writer hdfs://JFHbaseHDFS/hbase/WALs/jfhbase03,60020,1636456250380/jfhbase03%2C60020%2C1636456250380.default.1636463451629, unflushedEntries=0
java.io.IOException: Failing write. Tried pipeline recovery 5 times without success.
        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:1230)
        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:721)
2021-11-09 22:10:51,722 FATAL [regionserver/jfhbase03/9.180.152.43:60020.logRoller] regionserver.HRegionServer: ABORTING region server jfhbase03,60020,1636456250380: Failed log close in log roller
org.apache.hadoop.hbase.regionserver.wal.FailedLogCloseException: hdfs://JFHbaseHDFS/hbase/WALs/jfhbase03,60020,1636456250380/jfhbase03%2C60020%2C1636456250380.default.1636463451629, unflushedEntries=0
        at org.apache.hadoop.hbase.regionserver.wal.FSHLog.replaceWriter(FSHLog.java:886)
        at org.apache.hadoop.hbase.regionserver.wal.FSHLog.rollWriter(FSHLog.java:703)
        at org.apache.hadoop.hbase.regionserver.LogRoller.run(LogRoller.java:148)
        at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: Failing write. Tried pipeline recovery 5 times without success.
        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:1230)
        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:721)
2021-11-09 22:10:51,722 FATAL [regionserver/jfhbase03/9.180.152.43:60020.logRoller] regionserver.HRegionServer: RegionServer abort: loaded coprocessors are: []
2021-11-09 22:10:51,742 INFO  [regionserver/jfhbase03/9.180.152.43:60020.logRoller] regionserver.HRegionServer: STOPPED: Failed log close in log roller
2021-11-09 22:10:51,742 INFO  [regionserver/jfhbase03/9.180.152.43:60020.logRoller] regionserver.LogRoller: LogRoller exiting.
2021-11-09 22:10:51,742 INFO  [regionserver/jfhbase03/9.180.152.43:60020] regionserver.SplitLogWorker: Sending interrupt to stop the worker thread
2021-11-09 22:10:51,742 INFO  [regionserver/jfhbase03/9.180.152.43:60020] regionserver.HRegionServer: Stopping infoServer
2021-11-09 22:10:51,742 INFO  [SplitLogWorker-jfhbase03:60020] regionserver.SplitLogWorker: SplitLogWorker interrupted. Exiting. 
2021-11-09 22:10:51,743 INFO  [SplitLogWorker-jfhbase03:60020] regionserver.SplitLogWorker: SplitLogWorker jfhbase03,60020,1636456250380 exiting
2021-11-09 22:10:51,743 INFO  [regionserver/jfhbase03/9.180.152.43:60020] mortbay.log: Stopped SelectChannelConnector@0.0.0.0:60030
2021-11-09 22:10:51,843 INFO  [regionserver/jfhbase03/9.180.152.43:60020] regionserver.HeapMemoryManager: Stoping HeapMemoryTuner chore.
2021-11-09 22:10:51,843 INFO  [regionserver/jfhbase03/9.180.152.43:60020] flush.RegionServerFlushTableProcedureManager: Stopping region server flush procedure manager abruptly.
2021-11-09 22:10:51,843 INFO  [MemStoreFlusher.0] regionserver.MemStoreFlusher: MemStoreFlusher.0 exiting
2021-11-09 22:10:51,843 INFO  [MemStoreFlusher.1] regionserver.MemStoreFlusher: MemStoreFlusher.1 exiting
2021-11-09 22:10:51,843 INFO  [regionserver/jfhbase03/9.180.152.43:60020] snapshot.RegionServerSnapshotManager: Stopping RegionServerSnapshotManager abruptly.
2021-11-09 22:10:51,844 INFO  [regionserver/jfhbase03/9.180.152.43:60020] regionserver.HRegionServer: aborting server jfhbase03,60020,1636456250380
2021-11-09 22:10:51,844 INFO  [regionserver/jfhbase03/9.180.152.43:60020] client.ConnectionManager$HConnectionImplementation: Closing zookeeper sessionid=0x17d046470ac0004
2021-11-09 22:10:51,847 INFO  [regionserver/jfhbase03/9.180.152.43:60020] zookeeper.ZooKeeper: Session: 0x17d046470ac0004 closed
2021-11-09 22:10:51,847 INFO  [regionserver/jfhbase03/9.180.152.43:60020-EventThread] zookeeper.ClientCnxn: EventThread shut down
2021-11-09 22:10:51,848 INFO  [regionserver/jfhbase03/9.180.152.43:60020] regionserver.HRegionServer: stopping server jfhbase03,60020,1636456250380; all regions closed.
2021-11-09 22:10:51,848 WARN  [regionserver/jfhbase03/9.180.152.43:60020] wal.ProtobufLogWriter: Failed to write trailer, non-fatal, continuing...
java.nio.channels.ClosedChannelException
        at org.apache.hadoop.hdfs.DFSOutputStream.checkClosed(DFSOutputStream.java:1940)
        at org.apache.hadoop.fs.FSOutputSummer.write(FSOutputSummer.java:105)
        at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:58)
        at java.io.DataOutputStream.write(DataOutputStream.java:107)
        at com.google.protobuf.CodedOutputStream.refreshBuffer(CodedOutputStream.java:833)
        at com.google.protobuf.CodedOutputStream.flush(CodedOutputStream.java:843)
        at com.google.protobuf.AbstractMessageLite.writeTo(AbstractMessageLite.java:80)
        at org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.writeWALTrailer(ProtobufLogWriter.java:157)
        at org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.close(ProtobufLogWriter.java:130)
        at org.apache.hadoop.hbase.regionserver.wal.FSHLog.shutdown(FSHLog.java:1068)
        at org.apache.hadoop.hbase.wal.DefaultWALProvider.shutdown(DefaultWALProvider.java:114)
        at org.apache.hadoop.hbase.wal.WALFactory.shutdown(WALFactory.java:221)
        at org.apache.hadoop.hbase.regionserver.HRegionServer.shutdownWAL(HRegionServer.java:1321)
        at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:1070)
        at java.lang.Thread.run(Thread.java:748)
2021-11-09 22:10:51,859 INFO  [regionserver/jfhbase03/9.180.152.43:60020] regionserver.Leases: regionserver/jfhbase03/9.180.152.43:60020 closing leases
2021-11-09 22:10:51,859 INFO  [regionserver/jfhbase03/9.180.152.43:60020] regionserver.Leases: regionserver/jfhbase03/9.180.152.43:60020 closed leases
2021-11-09 22:10:51,859 INFO  [regionserver/jfhbase03/9.180.152.43:60020] hbase.ChoreService: Chore service for: jfhbase03,60020,1636456250380 had [[ScheduledChore: Name: jfhbase03,60020,1636456250380-MemstoreFlusherChore Period: 10000 Unit: MILLISECONDS], [ScheduledChore: Name: MovedRegionsCleaner for region jfhbase03,60020,1636456250380 Period: 120000 Unit: MILLISECONDS]] on shutdown
2021-11-09 22:10:51,859 INFO  [regionserver/jfhbase03/9.180.152.43:60020] regionserver.CompactSplitThread: Waiting for Split Thread to finish...
2021-11-09 22:10:51,859 INFO  [regionserver/jfhbase03/9.180.152.43:60020] regionserver.CompactSplitThread: Waiting for Merge Thread to finish...
2021-11-09 22:10:51,859 INFO  [regionserver/jfhbase03/9.180.152.43:60020] regionserver.CompactSplitThread: Waiting for Large Compaction Thread to finish...
2021-11-09 22:10:51,859 INFO  [regionserver/jfhbase03/9.180.152.43:60020] regionserver.CompactSplitThread: Waiting for Small Compaction Thread to finish...
2021-11-09 22:10:51,866 INFO  [regionserver/jfhbase03/9.180.152.43:60020] ipc.RpcServer: Stopping server on 60020
2021-11-09 22:10:51,866 INFO  [RpcServer.listener,port=60020] ipc.RpcServer: RpcServer.listener,port=60020: stopping
2021-11-09 22:10:51,866 INFO  [RpcServer.responder] ipc.RpcServer: RpcServer.responder: stopped
2021-11-09 22:10:51,866 INFO  [RpcServer.responder] ipc.RpcServer: RpcServer.responder: stopping
2021-11-09 22:10:51,874 INFO  [regionserver/jfhbase03/9.180.152.43:60020] zookeeper.ZooKeeper: Session: 0x37d04643ce40000 closed
2021-11-09 22:10:51,874 INFO  [main-EventThread] zookeeper.ClientCnxn: EventThread shut down
2021-11-09 22:10:51,874 INFO  [regionserver/jfhbase03/9.180.152.43:60020] regionserver.HRegionServer: stopping server jfhbase03,60020,1636456250380; zookeeper connection closed.
2021-11-09 22:10:51,874 INFO  [regionserver/jfhbase03/9.180.152.43:60020] regionserver.HRegionServer: regionserver/jfhbase03/9.180.152.43:60020 exiting
2021-11-09 22:10:51,874 ERROR [main] regionserver.HRegionServerCommandLine: Region server exiting
java.lang.RuntimeException: HRegionServer Aborted
        at org.apache.hadoop.hbase.regionserver.HRegionServerCommandLine.start(HRegionServerCommandLine.java:68)
        at org.apache.hadoop.hbase.regionserver.HRegionServerCommandLine.run(HRegionServerCommandLine.java:87)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
        at org.apache.hadoop.hbase.util.ServerCommandLine.doMain(ServerCommandLine.java:127)
        at org.apache.hadoop.hbase.regionserver.HRegionServer.main(HRegionServer.java:2676)
2021-11-09 22:10:51,875 INFO  [Thread-6] regionserver.ShutdownHook: Shutdown hook starting; hbase.shutdown.hook=true; fsShutdownHook=org.apache.hadoop.fs.FileSystem$Cache$ClientFinalizer@ea9b7c6
2021-11-09 22:10:51,875 INFO  [Thread-6] regionserver.ShutdownHook: Starting fs shutdown hook thread.
2021-11-09 22:10:51,885 INFO  [Thread-6] regionserver.ShutdownHook: Shutdown hook finished.
Wed Nov 10 15:35:54 CST 2021 Starting regionserver on jfhbase03
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 256263
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1000000
pipe size            (512 bytes, -p) 8
        at org.apache.hadoop.hbase.regionserver.wal.FSHLog.shutdown(FSHLog.java:1068)
        at org.apache.hadoop.hbase.wal.DefaultWALProvider.shutdown(DefaultWALProvider.java:114)
        at org.apache.hadoop.hbase.wal.WALFactory.shutdown(WALFactory.java:221)
        at org.apache.hadoop.hbase.regionserver.HRegionServer.shutdownWAL(HRegionServer.java:1321)
        at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:1070)
        at java.lang.Thread.run(Thread.java:748)
2021-11-09 22:10:51,859 INFO  [regionserver/jfhbase03/9.180.152.43:60020] regionserver.Leases: regionserver/jfhbase03/9.180.152.43:60020 closing leases
2021-11-09 22:10:51,859 INFO  [regionserver/jfhbase03/9.180.152.43:60020] regionserver.Leases: regionserver/jfhbase03/9.180.152.43:60020 closed leases
2021-11-09 22:10:51,859 INFO  [regionserver/jfhbase03/9.180.152.43:60020] hbase.ChoreService: Chore service for: jfhbase03,60020,1636456250380 had [[ScheduledChore: Name: jfhbase03,60020,16364562503
80-MemstoreFlusherChore Period: 10000 Unit: MILLISECONDS], [ScheduledChore: Name: MovedRegionsCleaner for region jfhbase03,60020,1636456250380 Period: 120000 Unit: MILLISECONDS]] on shutdown
2021-11-09 22:10:51,859 INFO  [regionserver/jfhbase03/9.180.152.43:60020] regionserver.CompactSplitThread: Waiting for Split Thread to finish...
2021-11-09 22:10:51,859 INFO  [regionserver/jfhbase03/9.180.152.43:60020] regionserver.CompactSplitThread: Waiting for Merge Thread to finish...
2021-11-09 22:10:51,859 INFO  [regionserver/jfhbase03/9.180.152.43:60020] regionserver.CompactSplitThread: Waiting for Large Compaction Thread to finish...
2021-11-09 22:10:51,859 INFO  [regionserver/jfhbase03/9.180.152.43:60020] regionserver.CompactSplitThread: Waiting for Small Compaction Thread to finish...
2021-11-09 22:10:51,866 INFO  [regionserver/jfhbase03/9.180.152.43:60020] ipc.RpcServer: Stopping server on 60020
2021-11-09 22:10:51,866 INFO  [RpcServer.listener,port=60020] ipc.RpcServer: RpcServer.listener,port=60020: stopping
2021-11-09 22:10:51,866 INFO  [RpcServer.responder] ipc.RpcServer: RpcServer.responder: stopped
2021-11-09 22:10:51,866 INFO  [RpcServer.responder] ipc.RpcServer: RpcServer.responder: stopping
2021-11-09 22:10:51,874 INFO  [regionserver/jfhbase03/9.180.152.43:60020] zookeeper.ZooKeeper: Session: 0x37d04643ce40000 closed
2021-11-09 22:10:51,874 INFO  [main-EventThread] zookeeper.ClientCnxn: EventThread shut down
2021-11-09 22:10:51,874 INFO  [regionserver/jfhbase03/9.180.152.43:60020] regionserver.HRegionServer: stopping server jfhbase03,60020,1636456250380; zookeeper connection closed.
2021-11-09 22:10:51,874 INFO  [regionserver/jfhbase03/9.180.152.43:60020] regionserver.HRegionServer: regionserver/jfhbase03/9.180.152.43:60020 exiting
2021-11-09 22:10:51,874 ERROR [main] regionserver.HRegionServerCommandLine: Region server exiting
java.lang.RuntimeException: HRegionServer Aborted
        at org.apache.hadoop.hbase.regionserver.HRegionServerCommandLine.start(HRegionServerCommandLine.java:68)
        at org.apache.hadoop.hbase.regionserver.HRegionServerCommandLine.run(HRegionServerCommandLine.java:87)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
        at org.apache.hadoop.hbase.util.ServerCommandLine.doMain(ServerCommandLine.java:127)
        at org.apache.hadoop.hbase.regionserver.HRegionServer.main(HRegionServer.java:2676)
2021-11-09 22:10:51,875 INFO  [Thread-6] regionserver.ShutdownHook: Shutdown hook starting; hbase.shutdown.hook=true; fsShutdownHook=org.apache.hadoop.fs.FileSystem$Cache$ClientFinalizer@ea9b7c6
2021-11-09 22:10:51,875 INFO  [Thread-6] regionserver.ShutdownHook: Starting fs shutdown hook thread.
2021-11-09 22:10:51,885 INFO  [Thread-6] regionserver.ShutdownHook: Shutdown hook finished.

Anyone can help, please

6 REPLIES 6

Super Collaborator

Hello @xgxshtc 

 

Thanks for using Cloudera Community. Based on the Post, RegionServer is reporting EOFException | "Bad DataNode" while replaying WALs "/hbase/WALs". It appears the HDFS Blocks are having issues. To fix the Issue, Your Team can Sideline the Contents of the "/hbase/WALs" (Specific "/hbase/WALs/jfhbase03,60020,1636456250380") & restart the concerned RegionServer. If all RegionServers are Impacted, Sideline each of the RegionServer Directories from "/hbase/WALs". Note that the WALs hold the Edits not yet persisted to Disk & Sidelining the WAL Directories (1 WAL per RegionServer) may incur Data Loss. 

 

Additionally, Review the HDFS FSCK of HDFS files with WALs "/hbase/WALs" & fix any Corrupt/Missing Blocks. After ensuring the HDFS FSCK of WALs "/hbase/WALs" is Healthy, Your Team can restart the HBase RegionServers. 

 

Regards, Smarak

Explorer

Thanks for replying. I have some questions. First I've tried reinstall the whole cluster(delete every contents), but the same thing happens again. And what do you mean Sildeline the Contents? When I restart the regionserver, the logs show like this:

 

2021-11-15 10:02:17,440 WARN  [ResponseProcessor for block BP-1448520106-9.180.152.33-1636618223804:blk_1073742122_1743] hdfs.DFSClient: DFSOutputStream ResponseProcessor exception  for block BP-1448520106-9.180.152.33-1636618223804:blk_1073742122_1743
java.io.EOFException: Premature EOF: no length prefix available
        at org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:2272)
        at org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:235)
        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:1075)
2021-11-15 10:02:17,440 WARN  [DataStreamer for file /hbase/WALs/jfhbase03,60020,1636941117142/jfhbase03%2C60020%2C1636941117142..meta.1636941557288.meta block BP-1448520106-9.180.152.33-1636618223804:blk_1073742122_1743] hdfs.DFSClient: Error Recovery for block BP-1448520106-9.180.152.33-1636618223804:blk_1073742122_1743 in pipeline DatanodeInfoWithStorage[9.180.152.30:50010,DS-c7dc93d4-3561-4673-b49a-5e78b685ddb0,DISK], DatanodeInfoWithStorage[9.180.152.40:50010,DS-e7a5fd42-9133-4d9b-ba12-8c5349e51249,DISK]: bad datanode DatanodeInfoWithStorage[9.180.152.30:50010,DS-c7dc93d4-3561-4673-b49a-5e78b685ddb0,DISK]

 

Super Collaborator

Hello @xgxshtc 

 

Thanks for the Update. By Sideline, We meant ensuring the HBase WAL Directory "/hbase/WALs" is Empty. Let us know if the below Steps helps:

  • Stop HBase RegionServers,
  • Sideline the WAL Directory Contents i.e. There shouldn't be any Directories with "/hbase/WALs",
  • Restart HBase RegionServers. 

 

Additionally, You reinstalled the Cluster & yet observed the concerned Issue again. This likely indicates the HDFS State may be Unhealthy. Any Chance you can review the HDFS FSCK on the HBase WAL Directory [1] to confirm whether the Blocks associated with the HBase WAL Files are Healthy.

 

Regards, Smarak

 

[1] https://hadoop.apache.org/docs/r1.2.1/commands_manual.html#fsck

 

Explorer

When stop the hbase regionserver, /hbase/WALs is not empty. So I delete them in /hbase/WALs, and check the "hdfs fsck /hbase/WALs", which shows me like this, I think it means healthy. Then I started hbase regionservers, problem is not solved.

 

Connecting to namenode via http://jfhbase01:50070/fsck?ugi=mqq&path=%2Fhbase%2FWALs
FSCK started by xgx (auth:SIMPLE) from /9.180.152.33 for path /hbase/WALs at Mon Nov 15 16:31:30 CST 2021
Status: HEALTHY
 Total size:    0 B (Total open files size: 1069 B)
 Total dirs:    10
 Total files:   0
 Total symlinks:                0 (Files currently being written: 10)
 Total blocks (validated):      0 (Total open file blocks (not validated): 10)
 Minimally replicated blocks:   0
 Over-replicated blocks:        0
 Under-replicated blocks:       0
 Mis-replicated blocks:         0
 Default replication factor:    3
 Average block replication:     0.0
 Corrupt blocks:                0
 Missing replicas:              0
 Number of data-nodes:          10
 Number of racks:               1
FSCK ended at Mon Nov 15 16:31:30 CST 2021 in 0 milliseconds


The filesystem under path '/hbase/WALs' is HEALTHY

 

Super Collaborator

Hello @xgxshtc 

 

Thanks for the Update. If you try to access the WAL File (For which the DFSOutputStream reports Premature EOF) via "hdfs dfs -cat/-head", Is the Command running successfully ? 

 

The FSCK Output can be modified to include "-openforwrite" to show the details of the 10 Files currently opened. 

 

Regards, Smarak

Explorer

I can see like this(via "hdfs dfs -cat /hbase/WALs/jfhbase04,60020,1637139292945/jfhbase04%2C60020%2C1637139292945.default.1637142895380"):

21/11/17 18:44:54 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
PWAL"ProtobufLogWriter*5org.apache.hadoop.hbase.regionserver.wal.WALCellCodecD
 281aa8ab0c063ad4f5cddab6b97f1296test:tb_deposit_log    * METAFAMILYHBASE::REGION_EVENT}-V5test:tb_deposit_log 281aa8ab0c063ad4f5cddab6b97f1296 *

infoinfo2
        jfhbase04test:tb_deposit_log,,1637143163577.281aa8ab0c063ad4f5cddab6b97f1296.

It should be a data writing. Will it release the contents and be writen into disk later?