Support Questions

mimani · ‎04-09-2018

We had a strange error in our HBase cluster which are not able to debug yet. This might have caused lot of latency spikes in our system, We see following logs in our region server logs after which it's IPC.QueueSize increased:

18:51:12,756 WARN  [DataStreamer for file /apps/hbase/data/WALs/hbase-dn-131,16020,1517942847659/hbase-dn-131%2C16020%2C1517942847659.default.1523243925996 block BP-1872413417-101.331.253.88-1458393583173:blk_1136880676_63140431] hdfs.DFSClient: Error Recovery for block BP-1872413417-101.331.253.88-1458393583173:blk_1136880676_63140431 in pipeline DatanodeInfoWithStorage[101.341.1.246:50010,DS-d0254124-f206-4315-b337-8867eeb53375,DISK], DatanodeInfoWithStorage[101.321.11.107:50010,DS-4af2f0fd-69d2-4d7b-bb33-beb380c8fdcc,DISK], DatanodeInfoWithStorage[101.321.73.234:50010,DS-42df372f-dc80-46f1-b3eb-71794a509749,DISK]: bad datanode DatanodeInfoWithStorage[101.321.73.234:50010,DS-42df372f-dc80-46f1-b3eb-71794a509749,DISK]
18:51:12,327 INFO  [DataStreamer for file /apps/hbase/data/WALs/hbase-dn-131,16020,1517942847659/hbase-dn-131%2C16020%2C1517942847659.default.1523243925996 block BP-1872413417-101.331.253.88-1458393583173:blk_1136880676_63140431] hdfs.DFSClient: Exception in createBlockOutputStream
java.io.IOException: Got error, status message , ack with firstBadLink as 101.331.85.8:50010
        at org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:140)
        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:1369)
        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1193)
        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:909)
        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:412)

and at the same time we see following error in the other HDFS logs of DN, where it is trying to create this replica:

18:51:13,492 INFO  impl.FsDatasetImpl (FsDatasetImpl.java:recoverRbw(1322)) - Recover RBW replica BP-1872413417-101.331.253.88-1458393583173:blk_1136880676_63140431
18:51:13,492 INFO  datanode.DataNode (DataXceiver.java:writeBlock(837)) - opWriteBlock BP-1872413417-101.331.253.88-1458393583173:blk_1136880676_63140431 received exception org.apache.hadoop.hdfs.server.datanode.ReplicaNotFoundException: Cannot append to a non-existent replica BP-1872413417-101.331.253.88-1458393583173:1136880676
18:51:13,492 ERROR datanode.DataNode (DataXceiver.java:run(278)) - hbase-dn-370:50010:DataXceiver error processing WRITE_BLOCK operation  src: /101.321.11.107:45232 dst: /101.321.202.112:50010
org.apache.hadoop.hdfs.server.datanode.ReplicaNotFoundException: Cannot append to a non-existent replica BP-1872413417-101.331.253.88-1458393583173:1136880676
        at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.getReplicaInfo(FsDatasetImpl.java:766)
        at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.recoverRbw(FsDatasetImpl.java:1324)
        at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.<init>(BlockReceiver.java:195)
        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:677)
        at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:137)
        at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:74)
        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:251)
        at java.lang.Thread.run(Thread.java:745)

This replica creation failed in all the DNs. We are not able to find why exactly this happened and what can be done about it.

On looking in the code of Hadoop, I see this is being raised from following code:

/**
   * Get the meta info of a block stored in volumeMap. Block is looked up
   * without matching the generation stamp.
   * @param bpid block pool Id
   * @param blkid block Id
   * @return the meta replica information; null if block was not found
   * @throws ReplicaNotFoundException if no entry is in the map or 
   *                        there is a generation stamp mismatch
   */
  private ReplicaInfo getReplicaInfo(String bpid, long blkid)
      throws ReplicaNotFoundException {
    ReplicaInfo info = volumeMap.get(bpid, blkid);
    if (info == null) {
      throw new ReplicaNotFoundException(
          ReplicaNotFoundException.NON_EXISTENT_REPLICA + bpid + ":" + blkid);
    }
    return info;
  }

which means hdfs was not able to find the block in the volume map, I dont understand who updates this volume map, why it was not updated in this case, or has this gone in wrong flow?

xyao · ‎04-10-2018

@Saurabh Saurabh, have you check the following HCC article to see if it applies to your case?

https://community.hortonworks.com/articles/16144/write-or-append-failures-in-very-small-clusters-un....

View solution in original post

xyao · ‎04-10-2018

@Saurabh Saurabh, have you check the following HCC article to see if it applies to your case?

https://community.hortonworks.com/articles/16144/write-or-append-failures-in-very-small-clusters-un....

Cloudera Community

Support Questions

Getting Error: ReplicaNotFoundException: Cannot append to a non-existent replica due to block not found in volumeMap

Hive start failed because of ambari error: mysql-...

NIfi: javax.security.auth.login.LoginException: N...

Target Replicas is 5 but found 3 live replica(s)

Write or Append failures in very small Clusters, u...

Replica Not Found errors after CDH 5.3.1-1 upgrade

hdfs fs -setrep -w 3 fails and Target Replicas is ...

Failed to connect to server: :8032: retries get fa...

How to handle: Unable to close file because the la...

The Untold Story of Block Access Token

Unable to load into HBase due to error with data b...