Getting " IOException: Failed to replace a bad datanode" while executing MapReduce Jobs

phoncy_joseph — Tue, 12 Apr 2016 19:48:57 GMT

I'm trying to execute a MapReduce streaming job in a 10 node Hadoop cluster(HDP2.2). There are 5 datanodes in the cluster. When the reduce phase reaches almost 100% completion, I'm getting the below error in client logs:

Error: java.io.IOException: Failed to replace a bad
datanode on the existing pipeline due to no more good datanodes being available
to try. (Nodes: current=[x.x.x.x:50010], original=[x.x.x.x:50010]).
The current failed datanode replacement policy is DEFAULT, and a client may
configure this via 'dfs.client.block.write.replace-datanode-on-failure.policy'
in its configuration

The data node on which the jobs were executing contained below logs:

 INFO datanode.DataNode (BlockReceiver.java:run(1222)) - PacketResponder:
BP-203711345-10.254.65.246-1444744156994:blk_1077645089_3914844,
type=HAS_DOWNSTREAM_IN_PIPELINE
java.io.EOFException: Premature EOF: no length prefix available              
at org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:2203)

java.io.IOException: Premature EOF from inputStream              
at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:194)

2016-04-10 08:12:14,477 WARN  datanode.DataNode
(BlockReceiver.java:run(1256)) - IOException in BlockReceiver.run(): 

java.io.IOException: Connection reset by peer

016-04-10 08:13:22,431 INFO  datanode.DataNode
(BlockReceiver.java:receiveBlock(816)) - Exception for
BP-203711345-x.x.x.x -1444744156994:blk_1077645082_3914836

java.net.SocketTimeoutException: 60000 millis timeout while
waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected
local=/XX.XXX.XX.XX:50010 remote=/XX.XXX.XX.XXX:57649]

The NameNode logs contained the below warning:

 WARN blockmanagement.BlockPlacementPolicy
(BlockPlacementPolicyDefault.java:chooseTarget(383)) - Failed to place enough
replicas, still in need of 1 to reach 2 (unavailableStorages=[DISK],
storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK],
creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=false) For more
information, please enable DEBUG log level on
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy

I had tried setting the below parameters in hdfs-site.xml

dfs.datanode.handler.count =10
dfs.client.file-block-storage-locations.num-threads = 10
dfs.datanode.socket.write.timeout=20000

But still the error persists. Kindly suggest a solution.

Thanks

Re: Getting " IOException: Failed to replace a bad datanode" while executing MapReduce Jobs

pminovic — Tue, 12 Apr 2016 20:48:04 GMT

Are all of your data nodes healthy and have enough available disk space? For some reasons writing block to one of them fails and beacuse your replication factor is 2 and replace-datanode-on-failure.policy=DEFAULT, NN will not try another DN and write fails. So, first make sure your DNs are all right. If they look good then try to set

dfs.client.block.write.replace-datanode-on-failure.policy=ALWAYS
dfs.client.block.write.replace-datanode-on-failure.best-effort=true

The second one works only in new versions of Hadoop (HDP-2.2.6 or later). See this and this for details.

Re: Getting " IOException: Failed to replace a bad datanode" while executing MapReduce Jobs

phoncy_joseph — Thu, 14 Apr 2016 17:17:37 GMT

Thanks for the suggestions.Two of the data nodes in the cluster had to be replaced, as it didn't have enough disk space. I have also set the below in hdfs configuration and the jobs started executing fine even though I have noticed "Premature end of fail" error in data node logs.

dfs.client.block.write.replace-datanode-on-failure.policy=ALWAYS

question Getting " IOException: Failed to replace a bad datanode" while executing MapReduce Jobs in Archives of Support Questions (Read Only)

Getting " IOException: Failed to replace a bad datanode" while executing MapReduce Jobs

Re: Getting " IOException: Failed to replace a bad datanode" while executing MapReduce Jobs

Re: Getting " IOException: Failed to replace a bad datanode" while executing MapReduce Jobs