Reply
Explorer
Posts: 16
Registered: ‎10-17-2014

Failed to replace a bad datanode

Hi folks,

 

We have a 16 node cluster with 3 Flume VMs handling the ingestion. All nodes are in good condition but we're getting the error listed below in each of the Flume logs. The only things I could find that would cause this is if you have a 1 node cluster and your replication is set to 3. Any ideas? Thanks for the help.

 

 

2015-07-17 07:25:09,432 WARN org.apache.flume.sink.hdfs.BucketWriter: Closing file: hdfs://nameservice1:8020/db/live/wifi_info/year=2015/month=07/day=10/_FlumeData.1436584835196.tmp failed. Will retry again in 180 seconds.

java.io.IOException: Failed to replace a bad datanode on the existing pipeline due to no more good datanodes being available to try. (Nodes: current=[10.100.55.65:50010, 10.100.55.62:50010], original=[10.100.55.65:50010, 10.100.55.62:50010]). The current failed datanode replacement policy is DEFAULT, and a client may configure this via 'dfs.client.block.write.replace-datanode-on-failure.policy' in its configuration.

        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.findNewDatanode(DFSOutputStream.java:960)

        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:1026)

        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1175)

        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:924)

        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:486)

2015-07-17 07:25:13,143 INFO org.apache.flume.sink.hdfs.HDFSDataStream: Serializer = TEXT, UseRawLocalFileSystem = false

Posts: 1,836
Kudos: 415
Solutions: 295
Registered: ‎07-31-2013

Re: Failed to replace a bad datanode

What version of CDH are you using?

Take a look at these two blog posts to understand this policy error better: http://blog.cloudera.com/blog/2015/02/understanding-hdfs-recovery-processes-part-1/ and http://blog.cloudera.com/blog/2015/03/understanding-hdfs-recovery-processes-part-2/

The real reason for why we could not replace a DN may be in the logs preceding this one, or on the NN logs (if DN selection itself returned none, a rare condition).

Your write still had 2 DNs so you could consider switching the config mentioned in the second part of the blog posts (The boolean switch of dfs.client.block.write.replace-datanode-on-failure.best-effort specifically, which is false by default).
Highlighted
Explorer
Posts: 16
Registered: ‎10-17-2014

Re: Failed to replace a bad datanode

Thanks for the reply. I'll check around some of the logs. 

 

Version: 2.3.0-cdh5.1.3, r8e266e052e423af592871e2dfe09d54c03f6a0e8

 

 

Explorer
Posts: 16
Registered: ‎10-17-2014

Re: Failed to replace a bad datanode

I do see some errors in the NN log:

2015-07-17 20:40:35,046 INFO org.apache.hadoop.hdfs.StateChange: DIR* completeFile: /tmp/hive-user/hive_2015-07-17_20-38-17_251_4261010774163426996-1/_task_tmp.-mr-10005/_tmp.000000_0 is closed by DFSClient_NONMAPREDUCE_113030927_1
2015-07-17 20:40:38,165 WARN org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException as:user (auth:SIMPLE) cause:java.io.FileNotFoundException: File does not exist: /db/live/wifi_info/year=2015/month=06/day=30/_FlumeData.1436900682209.tmp
2015-07-17 20:40:38,165 INFO org.apache.hadoop.ipc.Server: IPC Server handler 5 on 8020, call org.apache.hadoop.hdfs.protocol.ClientProtocol.isFileClosed from 10.100.55.113:35538 Call#440557 Retry#0: error: java.io.FileNotFoundException: File does not exist: /db/live/wifi_info/year=2015/month=06/day=30/_FlumeData.1436900682209.tmp

Posts: 1,836
Kudos: 415
Solutions: 295
Registered: ‎07-31-2013

Re: Failed to replace a bad datanode

How consistent is the error in the Flume logs? Do you get it constantly, or intermittently? Could you upload your Flume logs somewhere such as pastebin.com and share the link?
Explorer
Posts: 16
Registered: ‎10-17-2014

Re: Failed to replace a bad datanode

Here is the log from one of the 3 flume servers. It seems to be getting worse..

 

http://t10.net/dl/flume-cmf-flume-AGENT-flume01.log

 

Explorer
Posts: 16
Registered: ‎10-17-2014

Re: Failed to replace a bad datanode

We even set up a new flume today and after about an hour, the errors are coming in on it.

 

2015-07-23 19:48:54,451 WARN org.apache.flume.sink.hdfs.BucketWriter: Closing file: hdfs://nameservice1:8020/db/live/mobile_info/year=2015/month=07/day=23/_FlumeData.1437696374166.tmp failed. Will retry again in 180 seconds.

java.io.IOException: Failed to replace a bad datanode on the existing pipeline due to no more good datanodes being available to try. (Nodes: current=[10.100.55.60:50010, 10.100.55.64:50010], original=[10.100.55.60:50010, 10.100.55.64:50010]). The current failed datanode replacement policy is DEFAULT, and a client may configure this via 'dfs.client.block.write.replace-datanode-on-failure.policy' in its configuration.

        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.findNewDatanode(DFSOutputStream.java:960)

        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:1026)

        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1175)

        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:924)

        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:486)

2015-07-23 19:48:55,646 INFO org.apache.flume.channel.file.EventQueueBackingStoreFile: Checkpoint backup completed.

2015-07-23 19:49:03,072 INFO org.apache.flume.channel.file.EventQueueBackingStoreFile: Start checkpoint for /hdfs/01/flume/.flume-mobile_info/file-channel/checkpoint/checkpoint, elements to sync = 214678

Explorer
Posts: 16
Registered: ‎10-17-2014

Re: Failed to replace a bad datanode

[ Edited ]

Also seeing a lot of these in the Cloudera Manager event logs:

 

DatanodeRegistration(10.100.55.73, datanodeUuid=a997ecee-750b-4d67-824b-70204ddea221, infoPort=50075, ipcPort=50020, storageInfo=lv=-55;cid=cluster44;nsid=332967421;c=0):Failed to transfer BP-1411946530-10.100.55.5-1411456129617:blk_1087070708_13356354 to 10.100.55.72:50010 got
java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
at sun.nio.ch.IOUtil.read(IOUtil.java:197)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
at org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:57)
at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:118)
at java.io.FilterInputStream.read(FilterInputStream.java:83)
at java.io.FilterInputStream.read(FilterInputStream.java:83)
at org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:1986)
at org.apache.hadoop.hdfs.server.datanode.DataNode$DataTransfer.run(DataNode.java:1670)
at org.apache.hadoop.hdfs.server.datanode.DataNode.transferReplicaForPipelineRecovery(DataNode.java:2373)
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.transferBlock(DataXceiver.java:771)
at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opTransferBlock(Receiver.java:143)
at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:83)
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:229)
at java.lang.Thread.run(Thread.java:745)

 

and:

DataNode{data=FSDataset{dirpath='[/hdfs/01/dfs/dn/current, /hdfs/02/dfs/dn/current, /hdfs/03/dfs/dn/current, /hdfs/04/dfs/dn/current, /hdfs/05/dfs/dn/current]'}, localName='node12:50010', datanodeUuid='a997ecee-750b-4d67-824b-70204ddea221', xmitsInProgress=0}:Exception transfering block BP-1411946530-10.100.55.5-1411456129617:blk_1087070708_13356355 to mirror 10.100.55.72:50010: java.io.EOFException: Premature EOF: no length prefix available

New Contributor
Posts: 1
Registered: ‎09-16-2015

Re: Failed to replace a bad datanode

Hi, was this resolved?

Posts: 1,836
Kudos: 415
Solutions: 295
Registered: ‎07-31-2013

Re: Failed to replace a bad datanode

Per the logs at least, the Flume client is mainly having trouble talking to DNs with IP last octets 219 and 220. I'd check up the ifconfig on these two hosts and on the flume agent to ensure foremost that there's no issue with errors or frame count rising in its network interface output - as these would indicate an ongoing network issue that is the cause behind it (and the area of further investigation).
Announcements