Created 04-12-2016 12:48 PM
I'm trying to execute a MapReduce streaming job in a 10 node Hadoop cluster(HDP2.2). There are 5 datanodes in the cluster. When the reduce phase reaches almost 100% completion, I'm getting the below error in client logs:
Error: java.io.IOException: Failed to replace a bad datanode on the existing pipeline due to no more good datanodes being available to try. (Nodes: current=[x.x.x.x:50010], original=[x.x.x.x:50010]). The current failed datanode replacement policy is DEFAULT, and a client may configure this via 'dfs.client.block.write.replace-datanode-on-failure.policy' in its configuration
The data node on which the jobs were executing contained below logs:
INFO datanode.DataNode (BlockReceiver.java:run(1222)) - PacketResponder: BP-203711345-10.254.65.246-1444744156994:blk_1077645089_3914844, type=HAS_DOWNSTREAM_IN_PIPELINE java.io.EOFException: Premature EOF: no length prefix available at org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:2203) java.io.IOException: Premature EOF from inputStream at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:194) 2016-04-10 08:12:14,477 WARN datanode.DataNode (BlockReceiver.java:run(1256)) - IOException in BlockReceiver.run(): java.io.IOException: Connection reset by peer 016-04-10 08:13:22,431 INFO datanode.DataNode (BlockReceiver.java:receiveBlock(816)) - Exception for BP-203711345-x.x.x.x -1444744156994:blk_1077645082_3914836 java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/XX.XXX.XX.XX:50010 remote=/XX.XXX.XX.XXX:57649]
The NameNode logs contained the below warning:
WARN blockmanagement.BlockPlacementPolicy (BlockPlacementPolicyDefault.java:chooseTarget(383)) - Failed to place enough replicas, still in need of 1 to reach 2 (unavailableStorages=[DISK], storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=false) For more information, please enable DEBUG log level on org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy
I had tried setting the below parameters in hdfs-site.xml
dfs.datanode.handler.count =10 dfs.client.file-block-storage-locations.num-threads = 10 dfs.datanode.socket.write.timeout=20000
But still the error persists. Kindly suggest a solution.
Thanks
Created 04-12-2016 01:48 PM
Are all of your data nodes healthy and have enough available disk space? For some reasons writing block to one of them fails and beacuse your replication factor is 2 and replace-datanode-on-failure.policy=DEFAULT, NN will not try another DN and write fails. So, first make sure your DNs are all right. If they look good then try to set
dfs.client.block.write.replace-datanode-on-failure.policy=ALWAYS dfs.client.block.write.replace-datanode-on-failure.best-effort=true
The second one works only in new versions of Hadoop (HDP-2.2.6 or later). See this and this for details.
Created 04-12-2016 01:48 PM
Are all of your data nodes healthy and have enough available disk space? For some reasons writing block to one of them fails and beacuse your replication factor is 2 and replace-datanode-on-failure.policy=DEFAULT, NN will not try another DN and write fails. So, first make sure your DNs are all right. If they look good then try to set
dfs.client.block.write.replace-datanode-on-failure.policy=ALWAYS dfs.client.block.write.replace-datanode-on-failure.best-effort=true
The second one works only in new versions of Hadoop (HDP-2.2.6 or later). See this and this for details.
Created 04-14-2016 10:17 AM
Thanks for the suggestions.Two of the data nodes in the cluster had to be replaced, as it didn't have enough disk space. I have also set the below in hdfs configuration and the jobs started executing fine even though I have noticed "Premature end of fail" error in data node logs.
dfs.client.block.write.replace-datanode-on-failure.policy=ALWAYS