Created 04-28-2016 10:45 AM
I ran a terasort and it wouldn't complete, so we tried doing a large put and found this error on our DFS Client.
16/04/28 16:25:27 WARN hdfs.DFSClient: Slow ReadProcessor read fields took 52148ms (threshold=30000ms); ack: seqno: 25357 reply: SUCCESS reply: SUCCESS reply: SUCCESS downstreamAckTimeNanos: 61584486863 flag: 0 flag: 0 flag: 0, targets: [DatanodeInfoWithStorage[10.50.45.148:50010,DS-d4a0215d-8171-4a8b-a3a1-a6a7748b3f23,DISK], DatanodeInfoWithStorage[10.50.45.138:50010,DS-294fded8-1dcd-465e-89d6-c3d6fc9fb61f,DISK], DatanodeInfoWithStorage[10.50.45.143:50010,DS-987423d0-15f5-454c-9034-37a65933e743,DISK]] 16/04/28 16:28:52 WARN hdfs.DFSClient: Slow ReadProcessor read fields took 60247ms (threshold=30000ms); ack: seqno: -2 reply: SUCCESS reply: ERROR downstreamAckTimeNanos: 0 flag: 0 flag: 1, targets: [DatanodeInfoWithStorage[10.50.45.148:50010,DS-d4a0215d-8171-4a8b-a3a1-a6a7748b3f23,DISK], DatanodeInfoWithStorage[10.50.45.138:50010,DS-294fded8-1dcd-465e-89d6-c3d6fc9fb61f,DISK], DatanodeInfoWithStorage[10.50.45.143:50010,DS-987423d0-15f5-454c-9034-37a65933e743,DISK]] 16/04/28 16:28:52 WARN hdfs.DFSClient: DFSOutputStream ResponseProcessor exception for block BP-1466039745-10.50.45.131-1461703637937:blk_1073771085_31037 java.io.IOException: Bad response ERROR for block BP-1466039745-10.50.45.131-1461703637937:blk_1073771085_31037 from datanode DatanodeInfoWithStorage[10.50.45.138:50010,DS-294fded8-1dcd-465e-89d6-c3d6fc9fb61f,DISK] at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer
Along with a bunch of Broken Pipe Errors. Upon closer investigation, we're saw that the datanodes are experiencing a lot of these messages:
2016-04-28 01:55:48,546 WARN datanode.DataNode (BlockReceiver.java:receivePacket(562)) - Slow BlockReceiver write packet to mirror took 4284ms (threshold=300ms) 2016-04-28 01:55:48,954 WARN datanode.DataNode (BlockReceiver.java:receivePacket(562)) - Slow BlockReceiver write packet to mirror took 406ms (threshold=300ms) 2016-04-28 01:55:51,826 WARN datanode.DataNode (BlockReceiver.java:receivePacket(562)) - Slow BlockReceiver write packet to mirror took 2872ms (threshold=300ms) 2016-04-28 01:55:52,384 WARN datanode.DataNode (BlockReceiver.java:receivePacket(562)) - Slow BlockReceiver write packet to mirror took 557ms (threshold=300ms) 2016-04-28 01:55:54,870 WARN datanode.DataNode (BlockReceiver.java:receivePacket(562)) - Slow BlockReceiver write packet to mirror took 2486ms (threshold=300ms) 2016-04-28 01:55:59,770 WARN datanode.DataNode (BlockReceiver.java:receivePacket(562)) - Slow BlockReceiver write packet to mirror took 4900ms (threshold=300ms) 2016-04-28 01:56:01,402 WARN datanode.DataNode (BlockReceiver.java:receivePacket(562)) - Slow BlockReceiver write packet to mirror took 1631ms (threshold=300ms) 2016-04-28 01:56:03,451 WARN datanode.DataNode (BlockReceiver.java:receivePacket(562)) - Slow BlockReceiver write packet to mirror took 2048ms (threshold=300ms) 2016-04-28 01:56:04,550 WARN datanode.DataNode (BlockReceiver.java:receivePacket(562)) - Slow BlockReceiver write packet to mirror took 979ms (threshold=300ms) 2016-04-28 01:56:12,072 WARN datanode.DataNode (BlockReceiver.java:receivePacket(562)) - Slow BlockReceiver write packet to mirror took 7521ms (threshold=300ms)
It looks like a packet problem, upon checking ifconfig, we've dropped only 30 out of 3,000,000 packets. What else could we check on to pin down this issue?
Off-topic: Upon investigation we saw this in the *.out file of the datanode
max memory size (kbytes, -m) unlimited open files (-n) 1024 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 8192 cpu time (seconds, -t) unlimited max user processes (-u) 1547551 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited
But we're not experiencing the "Too many open files error" despite having 1k open file limit. PS: Thanks to everyone in this community, you guys are the best.
Created 06-02-2016 11:43 PM
This looks like a network issue of your datanodes to handle the replication workload. Can you check the ifconfig output for MTU of all the datanodes and ensure it is consistently configured?
Below is a short list from a tutorial by @mjohnson on network best practice, which could help you troubleshooting.
"Make certain all members to the HDP cluster have passwordless SSH configured.