Created 04-09-2018 06:24 PM
We had a strange error in our HBase cluster which are not able to debug yet. This might have caused lot of latency spikes in our system, We see following logs in our region server logs after which it's IPC.QueueSize increased:
18:51:12,756 WARN [DataStreamer for file /apps/hbase/data/WALs/hbase-dn-131,16020,1517942847659/hbase-dn-131%2C16020%2C1517942847659.default.1523243925996 block BP-1872413417-101.331.253.88-1458393583173:blk_1136880676_63140431] hdfs.DFSClient: Error Recovery for block BP-1872413417-101.331.253.88-1458393583173:blk_1136880676_63140431 in pipeline DatanodeInfoWithStorage[101.341.1.246:50010,DS-d0254124-f206-4315-b337-8867eeb53375,DISK], DatanodeInfoWithStorage[101.321.11.107:50010,DS-4af2f0fd-69d2-4d7b-bb33-beb380c8fdcc,DISK], DatanodeInfoWithStorage[101.321.73.234:50010,DS-42df372f-dc80-46f1-b3eb-71794a509749,DISK]: bad datanode DatanodeInfoWithStorage[101.321.73.234:50010,DS-42df372f-dc80-46f1-b3eb-71794a509749,DISK] 18:51:12,327 INFO [DataStreamer for file /apps/hbase/data/WALs/hbase-dn-131,16020,1517942847659/hbase-dn-131%2C16020%2C1517942847659.default.1523243925996 block BP-1872413417-101.331.253.88-1458393583173:blk_1136880676_63140431] hdfs.DFSClient: Exception in createBlockOutputStream java.io.IOException: Got error, status message , ack with firstBadLink as 101.331.85.8:50010 at org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:140) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:1369) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1193) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:909) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:412)
and at the same time we see following error in the other HDFS logs of DN, where it is trying to create this replica:
18:51:13,492 INFO impl.FsDatasetImpl (FsDatasetImpl.java:recoverRbw(1322)) - Recover RBW replica BP-1872413417-101.331.253.88-1458393583173:blk_1136880676_63140431 18:51:13,492 INFO datanode.DataNode (DataXceiver.java:writeBlock(837)) - opWriteBlock BP-1872413417-101.331.253.88-1458393583173:blk_1136880676_63140431 received exception org.apache.hadoop.hdfs.server.datanode.ReplicaNotFoundException: Cannot append to a non-existent replica BP-1872413417-101.331.253.88-1458393583173:1136880676 18:51:13,492 ERROR datanode.DataNode (DataXceiver.java:run(278)) - hbase-dn-370:50010:DataXceiver error processing WRITE_BLOCK operation src: /101.321.11.107:45232 dst: /101.321.202.112:50010 org.apache.hadoop.hdfs.server.datanode.ReplicaNotFoundException: Cannot append to a non-existent replica BP-1872413417-101.331.253.88-1458393583173:1136880676 at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.getReplicaInfo(FsDatasetImpl.java:766) at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.recoverRbw(FsDatasetImpl.java:1324) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.<init>(BlockReceiver.java:195) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:677) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:137) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:74) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:251) at java.lang.Thread.run(Thread.java:745)
This replica creation failed in all the DNs. We are not able to find why exactly this happened and what can be done about it.
On looking in the code of Hadoop, I see this is being raised from following code:
/** * Get the meta info of a block stored in volumeMap. Block is looked up * without matching the generation stamp. * @param bpid block pool Id * @param blkid block Id * @return the meta replica information; null if block was not found * @throws ReplicaNotFoundException if no entry is in the map or * there is a generation stamp mismatch */ private ReplicaInfo getReplicaInfo(String bpid, long blkid) throws ReplicaNotFoundException { ReplicaInfo info = volumeMap.get(bpid, blkid); if (info == null) { throw new ReplicaNotFoundException( ReplicaNotFoundException.NON_EXISTENT_REPLICA + bpid + ":" + blkid); } return info; }
which means hdfs was not able to find the block in the volume map, I dont understand who updates this volume map, why it was not updated in this case, or has this gone in wrong flow?
Created 04-10-2018 03:12 PM
@Saurabh Saurabh, have you check the following HCC article to see if it applies to your case?
Created 04-10-2018 03:12 PM
@Saurabh Saurabh, have you check the following HCC article to see if it applies to your case?