- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
Slow BlockReceiver write packet to mirror
- Labels:
-
Apache Hadoop
-
Apache YARN
Created ‎04-28-2016 10:45 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I ran a terasort and it wouldn't complete, so we tried doing a large put and found this error on our DFS Client.
16/04/28 16:25:27 WARN hdfs.DFSClient: Slow ReadProcessor read fields took 52148ms (threshold=30000ms); ack: seqno: 25357 reply: SUCCESS reply: SUCCESS reply: SUCCESS downstreamAckTimeNanos: 61584486863 flag: 0 flag: 0 flag: 0, targets: [DatanodeInfoWithStorage[10.50.45.148:50010,DS-d4a0215d-8171-4a8b-a3a1-a6a7748b3f23,DISK], DatanodeInfoWithStorage[10.50.45.138:50010,DS-294fded8-1dcd-465e-89d6-c3d6fc9fb61f,DISK], DatanodeInfoWithStorage[10.50.45.143:50010,DS-987423d0-15f5-454c-9034-37a65933e743,DISK]] 16/04/28 16:28:52 WARN hdfs.DFSClient: Slow ReadProcessor read fields took 60247ms (threshold=30000ms); ack: seqno: -2 reply: SUCCESS reply: ERROR downstreamAckTimeNanos: 0 flag: 0 flag: 1, targets: [DatanodeInfoWithStorage[10.50.45.148:50010,DS-d4a0215d-8171-4a8b-a3a1-a6a7748b3f23,DISK], DatanodeInfoWithStorage[10.50.45.138:50010,DS-294fded8-1dcd-465e-89d6-c3d6fc9fb61f,DISK], DatanodeInfoWithStorage[10.50.45.143:50010,DS-987423d0-15f5-454c-9034-37a65933e743,DISK]] 16/04/28 16:28:52 WARN hdfs.DFSClient: DFSOutputStream ResponseProcessor exception for block BP-1466039745-10.50.45.131-1461703637937:blk_1073771085_31037 java.io.IOException: Bad response ERROR for block BP-1466039745-10.50.45.131-1461703637937:blk_1073771085_31037 from datanode DatanodeInfoWithStorage[10.50.45.138:50010,DS-294fded8-1dcd-465e-89d6-c3d6fc9fb61f,DISK] at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer
Along with a bunch of Broken Pipe Errors. Upon closer investigation, we're saw that the datanodes are experiencing a lot of these messages:
2016-04-28 01:55:48,546 WARN datanode.DataNode (BlockReceiver.java:receivePacket(562)) - Slow BlockReceiver write packet to mirror took 4284ms (threshold=300ms) 2016-04-28 01:55:48,954 WARN datanode.DataNode (BlockReceiver.java:receivePacket(562)) - Slow BlockReceiver write packet to mirror took 406ms (threshold=300ms) 2016-04-28 01:55:51,826 WARN datanode.DataNode (BlockReceiver.java:receivePacket(562)) - Slow BlockReceiver write packet to mirror took 2872ms (threshold=300ms) 2016-04-28 01:55:52,384 WARN datanode.DataNode (BlockReceiver.java:receivePacket(562)) - Slow BlockReceiver write packet to mirror took 557ms (threshold=300ms) 2016-04-28 01:55:54,870 WARN datanode.DataNode (BlockReceiver.java:receivePacket(562)) - Slow BlockReceiver write packet to mirror took 2486ms (threshold=300ms) 2016-04-28 01:55:59,770 WARN datanode.DataNode (BlockReceiver.java:receivePacket(562)) - Slow BlockReceiver write packet to mirror took 4900ms (threshold=300ms) 2016-04-28 01:56:01,402 WARN datanode.DataNode (BlockReceiver.java:receivePacket(562)) - Slow BlockReceiver write packet to mirror took 1631ms (threshold=300ms) 2016-04-28 01:56:03,451 WARN datanode.DataNode (BlockReceiver.java:receivePacket(562)) - Slow BlockReceiver write packet to mirror took 2048ms (threshold=300ms) 2016-04-28 01:56:04,550 WARN datanode.DataNode (BlockReceiver.java:receivePacket(562)) - Slow BlockReceiver write packet to mirror took 979ms (threshold=300ms) 2016-04-28 01:56:12,072 WARN datanode.DataNode (BlockReceiver.java:receivePacket(562)) - Slow BlockReceiver write packet to mirror took 7521ms (threshold=300ms)
It looks like a packet problem, upon checking ifconfig, we've dropped only 30 out of 3,000,000 packets. What else could we check on to pin down this issue?
Off-topic: Upon investigation we saw this in the *.out file of the datanode
max memory size (kbytes, -m) unlimited open files (-n) 1024 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 8192 cpu time (seconds, -t) unlimited max user processes (-u) 1547551 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited
But we're not experiencing the "Too many open files error" despite having 1k open file limit. PS: Thanks to everyone in this community, you guys are the best.
Created ‎06-02-2016 11:43 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
This looks like a network issue of your datanodes to handle the replication workload. Can you check the ifconfig output for MTU of all the datanodes and ensure it is consistently configured?
Below is a short list from a tutorial by @mjohnson on network best practice, which could help you troubleshooting.
"Make certain all members to the HDP cluster have passwordless SSH configured.
- Basic heartbeat (typical 3x/second) and administrative commands generated by the Hadoop cluster are infrequent and transfer only small amounts of data except in the extremely large cluster deployments.
- Keep in mind that NAS disks will require more network utilization than plain old disk drives resident on the data node.
- Make certain both host Fully Qualified Host Names as well as Host aliases are defined and referenceable by all nodes within the cluster.
- Ensure the network interface is consistently defined for all members of the Hadoop cluster (i.e. MTU settings should be consistent)
- Look into defining MTU for all interfaces on the cluster to support Jumbo Frames (typically MTU=9000). But only do this make certain that all nodes and switches support this functionality. Inconsistent MTU or undefined MTU configurations can produce serious problems with the network.
- Disable Transparent Huge Page compaction for all nodes on the cluster.
- Make certain all all of the HDP cluster’s network connections are monitored for collisions and lost packets. Have the Network administration team tune the network as required to address any issues identified as part of the network monitoring."
