Support Questions

Find answers, ask questions, and share your expertise

Hbase Region Servers Failure

avatar
Contributor

error-log-hbase.txtMy Hbase Region servers are getting frequently failed. I have googled it and tried some options to fix this issue.i have changed this configuration as mentioned below in hdfs-site .xml through Ambari, but no improvement.And my disk space is also free enough to store data.For your reference i had attached my Log details.Kindly suggest

dfs.client.block.write.replace-datanode-on-failure.enable=true
dfs.client.block.write.replace-datanode-on-failure.policy =DEFAULT
6 REPLIES 6

avatar

Have you checked your HDFS Health. You can check it using hdfs fsck.

avatar
Contributor

Hi nshelke,

Yes, I have Checked it's Healthy and no datanode failure but i found missing replicas.

avatar
Super Guru

It would appear from the logs that you only have two datanodes. You don't have any datanodes to replace, therefore this property can't actually do anything. Either stabilize your datanodes, add more datanodes, or reduce the HDFS replication.

avatar
Contributor

Hi Josh,

I'm having 5 Data nodes all are healthy and there is no Datanode volume failure. whether there is any other alternative way to fix this issue.If you see the log you can find more repeated Error and Fatal message

ERROR [RS_CLOSE_REGION-aps-hadoop5:16020-0] regionserver.HRegion: Memstore size is 147136.

FATAL [regionserver/aps-hadoop5/1..1..1..:16020.logRoller] regionserver.HRegionServer: ABORTING region server aps-hadoop5,16020,1493618413009: Failed log close in log roller.

whether this is will impact any thing.Kindly suggest

avatar
Super Guru

Something is happening in your datanodes that is causing HBase to mark them as "bad"

2017-05-03 21:22:38,729 WARN  [DataStreamer for file /apps/hbase/data/WALs/aps-hadoop5,16020,1493618413009/aps-hadoop5%2C16020%2C1493618413009.default.1493846432867 block BP-1810172115-10.64.228.157-1478343078462:blk_1079562185_5838908] hdfs.DFSClient: Error Recovery for block BP-1810172115-10.64.228.157-1478343078462:blk_1079562185_5838908 in pipeline DatanodeInfoWithStorage[1..1..1..:50010,DS-751946a0-5a6f-4485-ad27-61f061359410,DISK], DatanodeInfoWithStorage[10.64.228.140:50010,DS-8ab76f9c-ee05-4ec0-897a-8718ab89635f,DISK], DatanodeInfoWithStorage[10.64.228.150:50010,DS-57010fb6-92c0-4c3e-8b9e-11233ceb7bfa,DISK]: bad datanode DatanodeInfoWithStorage[1..1..1..:50010,DS-751946a0-5a6f-4485-ad27-61f061359410,DISK]
2017-05-03 21:22:41,744 INFO  [DataStreamer for file /apps/hbase/data/WALs/aps-hadoop5,16020,1493618413009/aps-hadoop5%2C16020%2C1493618413009.default.1493846432867 block BP-1810172115-10.64.228.157-1478343078462:blk_1079562185_5838908] hdfs.DFSClient: Exception in createBlockOutputStream
java.io.IOException: Got error, status message , ack with firstBadLink as 10.64.228.164:50010
	at org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:140)
	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:1393)
	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1217)
	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:904)
	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:411)
2017-05-03 21:22:41,745 WARN  [DataStreamer for file /apps/hbase/data/WALs/aps-hadoop5,16020,1493618413009/aps-hadoop5%2C16020%2C1493618413009.default.1493846432867 block BP-1810172115-10.64.228.157-1478343078462:blk_1079562185_5838908] hdfs.DFSClient: Error Recovery for block BP-1810172115-10.64.228.157-1478343078462:blk_1079562185_5838908 in pipeline DatanodeInfoWithStorage[10.64.228.140:50010,DS-8ab76f9c-ee05-4ec0-897a-8718ab89635f,DISK], DatanodeInfoWithStorage[10.64.228.150:50010,DS-57010fb6-92c0-4c3e-8b9e-11233ceb7bfa,DISK], DatanodeInfoWithStorage[10.64.228.164:50010,DS-9ba4f08a-d996-4490-b27d-6c8ca9a67152,DISK]: bad datanode DatanodeInfoWithStorage[10.64.228.164:50010,DS-9ba4f08a-d996-4490-b27d-6c8ca9a67152,DISK]
2017-05-03 21:22:44,779 INFO  [DataStreamer for file /apps/hbase/data/WALs/aps-hadoop5,16020,1493618413009/aps-hadoop5%2C16020%2C1493618413009.default.1493846432867 block BP-1810172115-10.64.228.157-1478343078462:blk_1079562185_5838908] hdfs.DFSClient: Exception in createBlockOutputStream
java.io.IOException: Got error, status message , ack with firstBadLink as 10.64.228.141:50010
	at org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:140)
	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:1393)
	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1217)
	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:904)
	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:411)

I'd look in those datanode logs and figure out why they failed to respond to HBase writing data. It seems like HBase gets down to Datanodes that it can actually talk to (out of your five). In general, your HDFS seems very unstable as it, at one point, took over 70seconds to sync data (this should be a sub-second operation)

2017-05-03 21:22:44,782 INFO  [sync.0] wal.FSHLog: Slow sync cost: 72065 ms, current pipeline: [DatanodeInfoWithStorage[10.64.228.140:50010,DS-8ab76f9c-ee05-4ec0-897a-8718ab89635f,DISK], DatanodeInfoWithStorage[10.64.228.150:50010,DS-57010fb6-92c0-4c3e-8b9e-11233ceb7bfa,DISK]]

avatar
Explorer

Something does not looks right about this one DN identified with an unusual IP address. 1..1..1..:50010

2017-05-03 21:22:38,729 WARN  [DataStreamer for file /apps/hbase/data/WALs/aps-hadoop5,16020,1493618413009/aps-hadoop5%2C16020%2C1493618413009.default.1493846432867 block BP-1810172115-10.64.228.157-1478343078462:blk_1079562185_5838908] hdfs.DFSClient: Error Recovery for block BP-1810172115-10.64.228.157-1478343078462:blk_1079562185_5838908 in pipeline DatanodeInfoWithStorage[1..1..1..:50010,DS-751946a0-5a6f-4485-ad27-61f061359410,DISK], DatanodeInfoWithStorage[10.64.228.140:50010,DS-8ab76f9c-ee05-4ec0-897a-8718ab89635f,DISK], DatanodeInfoWithStorage[10.64.228.150:50010,DS-57010fb6-92c0-4c3e-8b9e-11233ceb7bfa,DISK]: bad datanode DatanodeInfoWithStorage[1..1..1..:50010,DS-751946a0-5a6f-4485-ad27-61f061359410,DISK]