Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Failed to replace a bad datanode

Failed to replace a bad datanode

Explorer

Hi folks,

 

We have a 16 node cluster with 3 Flume VMs handling the ingestion. All nodes are in good condition but we're getting the error listed below in each of the Flume logs. The only things I could find that would cause this is if you have a 1 node cluster and your replication is set to 3. Any ideas? Thanks for the help.

 

 

2015-07-17 07:25:09,432 WARN org.apache.flume.sink.hdfs.BucketWriter: Closing file: hdfs://nameservice1:8020/db/live/wifi_info/year=2015/month=07/day=10/_FlumeData.1436584835196.tmp failed. Will retry again in 180 seconds.

java.io.IOException: Failed to replace a bad datanode on the existing pipeline due to no more good datanodes being available to try. (Nodes: current=[10.100.55.65:50010, 10.100.55.62:50010], original=[10.100.55.65:50010, 10.100.55.62:50010]). The current failed datanode replacement policy is DEFAULT, and a client may configure this via 'dfs.client.block.write.replace-datanode-on-failure.policy' in its configuration.

        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.findNewDatanode(DFSOutputStream.java:960)

        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:1026)

        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1175)

        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:924)

        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:486)

2015-07-17 07:25:13,143 INFO org.apache.flume.sink.hdfs.HDFSDataStream: Serializer = TEXT, UseRawLocalFileSystem = false

10 REPLIES 10
Highlighted

Re: Failed to replace a bad datanode

Master Guru
What version of CDH are you using?

Take a look at these two blog posts to understand this policy error better: http://blog.cloudera.com/blog/2015/02/understanding-hdfs-recovery-processes-part-1/ and http://blog.cloudera.com/blog/2015/03/understanding-hdfs-recovery-processes-part-2/

The real reason for why we could not replace a DN may be in the logs preceding this one, or on the NN logs (if DN selection itself returned none, a rare condition).

Your write still had 2 DNs so you could consider switching the config mentioned in the second part of the blog posts (The boolean switch of dfs.client.block.write.replace-datanode-on-failure.best-effort specifically, which is false by default).

Re: Failed to replace a bad datanode

Explorer

Thanks for the reply. I'll check around some of the logs. 

 

Version: 2.3.0-cdh5.1.3, r8e266e052e423af592871e2dfe09d54c03f6a0e8

 

 

Re: Failed to replace a bad datanode

Explorer

I do see some errors in the NN log:

2015-07-17 20:40:35,046 INFO org.apache.hadoop.hdfs.StateChange: DIR* completeFile: /tmp/hive-user/hive_2015-07-17_20-38-17_251_4261010774163426996-1/_task_tmp.-mr-10005/_tmp.000000_0 is closed by DFSClient_NONMAPREDUCE_113030927_1
2015-07-17 20:40:38,165 WARN org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException as:user (auth:SIMPLE) cause:java.io.FileNotFoundException: File does not exist: /db/live/wifi_info/year=2015/month=06/day=30/_FlumeData.1436900682209.tmp
2015-07-17 20:40:38,165 INFO org.apache.hadoop.ipc.Server: IPC Server handler 5 on 8020, call org.apache.hadoop.hdfs.protocol.ClientProtocol.isFileClosed from 10.100.55.113:35538 Call#440557 Retry#0: error: java.io.FileNotFoundException: File does not exist: /db/live/wifi_info/year=2015/month=06/day=30/_FlumeData.1436900682209.tmp

Re: Failed to replace a bad datanode

Master Guru
How consistent is the error in the Flume logs? Do you get it constantly, or intermittently? Could you upload your Flume logs somewhere such as pastebin.com and share the link?

Re: Failed to replace a bad datanode

Explorer

Here is the log from one of the 3 flume servers. It seems to be getting worse..

 

http://t10.net/dl/flume-cmf-flume-AGENT-flume01.log

 

Re: Failed to replace a bad datanode

Explorer

We even set up a new flume today and after about an hour, the errors are coming in on it.

 

2015-07-23 19:48:54,451 WARN org.apache.flume.sink.hdfs.BucketWriter: Closing file: hdfs://nameservice1:8020/db/live/mobile_info/year=2015/month=07/day=23/_FlumeData.1437696374166.tmp failed. Will retry again in 180 seconds.

java.io.IOException: Failed to replace a bad datanode on the existing pipeline due to no more good datanodes being available to try. (Nodes: current=[10.100.55.60:50010, 10.100.55.64:50010], original=[10.100.55.60:50010, 10.100.55.64:50010]). The current failed datanode replacement policy is DEFAULT, and a client may configure this via 'dfs.client.block.write.replace-datanode-on-failure.policy' in its configuration.

        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.findNewDatanode(DFSOutputStream.java:960)

        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:1026)

        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1175)

        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:924)

        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:486)

2015-07-23 19:48:55,646 INFO org.apache.flume.channel.file.EventQueueBackingStoreFile: Checkpoint backup completed.

2015-07-23 19:49:03,072 INFO org.apache.flume.channel.file.EventQueueBackingStoreFile: Start checkpoint for /hdfs/01/flume/.flume-mobile_info/file-channel/checkpoint/checkpoint, elements to sync = 214678

Re: Failed to replace a bad datanode

Explorer

Also seeing a lot of these in the Cloudera Manager event logs:

 

DatanodeRegistration(10.100.55.73, datanodeUuid=a997ecee-750b-4d67-824b-70204ddea221, infoPort=50075, ipcPort=50020, storageInfo=lv=-55;cid=cluster44;nsid=332967421;c=0):Failed to transfer BP-1411946530-10.100.55.5-1411456129617:blk_1087070708_13356354 to 10.100.55.72:50010 got
java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
at sun.nio.ch.IOUtil.read(IOUtil.java:197)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
at org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:57)
at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:118)
at java.io.FilterInputStream.read(FilterInputStream.java:83)
at java.io.FilterInputStream.read(FilterInputStream.java:83)
at org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:1986)
at org.apache.hadoop.hdfs.server.datanode.DataNode$DataTransfer.run(DataNode.java:1670)
at org.apache.hadoop.hdfs.server.datanode.DataNode.transferReplicaForPipelineRecovery(DataNode.java:2373)
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.transferBlock(DataXceiver.java:771)
at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opTransferBlock(Receiver.java:143)
at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:83)
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:229)
at java.lang.Thread.run(Thread.java:745)

 

and:

DataNode{data=FSDataset{dirpath='[/hdfs/01/dfs/dn/current, /hdfs/02/dfs/dn/current, /hdfs/03/dfs/dn/current, /hdfs/04/dfs/dn/current, /hdfs/05/dfs/dn/current]'}, localName='node12:50010', datanodeUuid='a997ecee-750b-4d67-824b-70204ddea221', xmitsInProgress=0}:Exception transfering block BP-1411946530-10.100.55.5-1411456129617:blk_1087070708_13356355 to mirror 10.100.55.72:50010: java.io.EOFException: Premature EOF: no length prefix available

Re: Failed to replace a bad datanode

New Contributor

Hi, was this resolved?

Re: Failed to replace a bad datanode

Master Guru
Per the logs at least, the Flume client is mainly having trouble talking to DNs with IP last octets 219 and 220. I'd check up the ifconfig on these two hosts and on the flume agent to ensure foremost that there's no issue with errors or frame count rising in its network interface output - as these would indicate an ongoing network issue that is the cause behind it (and the area of further investigation).