- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
Getting " IOException: Failed to replace a bad datanode" while executing MapReduce Jobs
- Labels:
-
Apache Hadoop
Created ‎04-12-2016 12:48 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I'm trying to execute a MapReduce streaming job in a 10 node Hadoop cluster(HDP2.2). There are 5 datanodes in the cluster. When the reduce phase reaches almost 100% completion, I'm getting the below error in client logs:
Error: java.io.IOException: Failed to replace a bad datanode on the existing pipeline due to no more good datanodes being available to try. (Nodes: current=[x.x.x.x:50010], original=[x.x.x.x:50010]). The current failed datanode replacement policy is DEFAULT, and a client may configure this via 'dfs.client.block.write.replace-datanode-on-failure.policy' in its configuration
The data node on which the jobs were executing contained below logs:
INFO datanode.DataNode (BlockReceiver.java:run(1222)) - PacketResponder: BP-203711345-10.254.65.246-1444744156994:blk_1077645089_3914844, type=HAS_DOWNSTREAM_IN_PIPELINE java.io.EOFException: Premature EOF: no length prefix available at org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:2203) java.io.IOException: Premature EOF from inputStream at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:194) 2016-04-10 08:12:14,477 WARN datanode.DataNode (BlockReceiver.java:run(1256)) - IOException in BlockReceiver.run(): java.io.IOException: Connection reset by peer 016-04-10 08:13:22,431 INFO datanode.DataNode (BlockReceiver.java:receiveBlock(816)) - Exception for BP-203711345-x.x.x.x -1444744156994:blk_1077645082_3914836 java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/XX.XXX.XX.XX:50010 remote=/XX.XXX.XX.XXX:57649]
The NameNode logs contained the below warning:
WARN blockmanagement.BlockPlacementPolicy (BlockPlacementPolicyDefault.java:chooseTarget(383)) - Failed to place enough replicas, still in need of 1 to reach 2 (unavailableStorages=[DISK], storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=false) For more information, please enable DEBUG log level on org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy
I had tried setting the below parameters in hdfs-site.xml
dfs.datanode.handler.count =10 dfs.client.file-block-storage-locations.num-threads = 10 dfs.datanode.socket.write.timeout=20000
But still the error persists. Kindly suggest a solution.
Thanks
Created ‎04-12-2016 01:48 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Are all of your data nodes healthy and have enough available disk space? For some reasons writing block to one of them fails and beacuse your replication factor is 2 and replace-datanode-on-failure.policy=DEFAULT, NN will not try another DN and write fails. So, first make sure your DNs are all right. If they look good then try to set
dfs.client.block.write.replace-datanode-on-failure.policy=ALWAYS dfs.client.block.write.replace-datanode-on-failure.best-effort=true
The second one works only in new versions of Hadoop (HDP-2.2.6 or later). See this and this for details.
Created ‎04-12-2016 01:48 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Are all of your data nodes healthy and have enough available disk space? For some reasons writing block to one of them fails and beacuse your replication factor is 2 and replace-datanode-on-failure.policy=DEFAULT, NN will not try another DN and write fails. So, first make sure your DNs are all right. If they look good then try to set
dfs.client.block.write.replace-datanode-on-failure.policy=ALWAYS dfs.client.block.write.replace-datanode-on-failure.best-effort=true
The second one works only in new versions of Hadoop (HDP-2.2.6 or later). See this and this for details.
Created ‎04-14-2016 10:17 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for the suggestions.Two of the data nodes in the cluster had to be replaced, as it didn't have enough disk space. I have also set the below in hdfs configuration and the jobs started executing fine even though I have noticed "Premature end of fail" error in data node logs.
dfs.client.block.write.replace-datanode-on-failure.policy=ALWAYS
