Support Questions

Find answers, ask questions, and share your expertise

Solution for "slow readprocessor" warnings

avatar
Super Collaborator

Our cluster suddenly got very slow, for no evident reason.

A couple of our nodes seem to be the bottleneck, but it isn't clear why. They are not swapping, and THP is setup correctly. And we just rebalanced everything, so skew should not be an issue.

When I look at the logs for slow queries, I see a lot of this.

I'm not totally sure how to interpret what I see.

"Slow ReadProcessor read fields took 165073ms"

Does this mean that HDFS took almost 3 minutes to read a block of data?

If so, does anyone know why this might be happening?

2016-03-09 12:30:22,269 ERROR [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Container complete event for unknown container id container_e30_1457112266933_0219_01_000303
2016-03-09 12:30:40,362 ERROR [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Container complete event for unknown container id container_e30_1457112266933_0219_01_000368
2016-03-09 12:30:40,362 ERROR [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Container complete event for unknown container id container_e30_1457112266933_0219_01_000367
2016-03-09 12:30:50,415 ERROR [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Container complete event for unknown container id container_e30_1457112266933_0219_01_000394
2016-03-09 12:30:50,415 ERROR [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Container complete event for unknown container id container_e30_1457112266933_0219_01_000393
2016-03-09 12:30:54,437 ERROR [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Container complete event for unknown container id container_e30_1457112266933_0219_01_000405
2016-03-09 12:31:00,467 ERROR [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Container complete event for unknown container id container_e30_1457112266933_0219_01_000418
2016-03-09 12:46:13,052 ERROR [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Container complete event for unknown container id container_e30_1457112266933_0219_01_000881
2016-03-09 12:49:32,997 WARN [ResponseProcessor for block BP-1450007529-AA.BB.CC.QQ-1415122377411:blk_1097162958_23469355] org.apache.hadoop.hdfs.DFSClient: Slow ReadProcessor read fields took 76652ms (threshold=30000ms); ack: seqno: 1587 status: SUCCESS status: SUCCESS status: SUCCESS downstreamAckTimeNanos: 551441, targets: [AA.BB.CC.DD:50010, AA.BB.CC.XX:50010, AA.BB.CC.ZZ:50010]
2016-03-09 12:50:56,503 WARN [ResponseProcessor for block BP-1450007529-AA.BB.CC.QQ-1415122377411:blk_1097162958_23469355] org.apache.hadoop.hdfs.DFSClient: Slow ReadProcessor read fields took 73728ms (threshold=30000ms); ack: seqno: 1593 status: SUCCESS status: SUCCESS status: SUCCESS downstreamAckTimeNanos: 650743, targets: [AA.BB.CC.DD:50010, AA.BB.CC.XX:50010, AA.BB.CC.ZZ:50010]
2016-03-09 12:53:07,327 WARN [ResponseProcessor for block BP-1450007529-AA.BB.CC.QQ-1415122377411:blk_1097162958_23469355] org.apache.hadoop.hdfs.DFSClient: Slow ReadProcessor read fields took 76607ms (threshold=30000ms); ack: seqno: 1601 status: SUCCESS status: SUCCESS status: SUCCESS downstreamAckTimeNanos: 518370, targets: [AA.BB.CC.DD:50010, AA.BB.CC.XX:50010, AA.BB.CC.ZZ:50010]
2016-03-09 12:54:59,399 WARN [ResponseProcessor for block BP-1450007529-AA.BB.CC.QQ-1415122377411:blk_1097162958_23469355] org.apache.hadoop.hdfs.DFSClient: Slow ReadProcessor read fields took 112072ms (threshold=30000ms); ack: seqno: 1603 status: SUCCESS status: SUCCESS status: SUCCESS downstreamAckTimeNanos: 1093642, targets: [AA.BB.CC.DD:50010, AA.BB.CC.XX:50010, AA.BB.CC.ZZ:50010]
2016-03-09 12:57:44,473 WARN [ResponseProcessor for block BP-1450007529-AA.BB.CC.QQ-1415122377411:blk_1097162958_23469355] org.apache.hadoop.hdfs.DFSClient: Slow ReadProcessor read fields took 165073ms (threshold=30000ms); ack: seqno: 1605 status: SUCCESS status: SUCCESS status: SUCCESS downstreamAckTimeNanos: 682741, targets: [AA.BB.CC.DD:50010, AA.BB.CC.XX:50010, AA.BB.CC.ZZ:50010]
2016-03-09 13:01:42,644 WARN [ResponseProcessor for block BP-1450007529-AA.BB.CC.QQ-1415122377411:blk_1097162958_23469355] org.apache.hadoop.hdfs.DFSClient: Slow ReadProcessor read fields took 64163ms (threshold=30000ms); ack: seqno: 2005 status: SUCCESS status: SUCCESS status: SUCCESS downstreamAckTimeNanos: 455684, targets: [AA.BB.CC.DD:50010, AA.BB.CC.XX:50010, AA.BB.CC.ZZ:50010]
2016-03-09 13:17:21,104 WARN [ResponseProcessor for block BP-1450007529-AA.BB.CC.QQ-1415122377411:blk_1097162958_23469355] org.apache.hadoop.hdfs.DFSClient: Slow ReadProcessor read fields took 57172ms (threshold=30000ms); ack: seqno: 3545 status: SUCCESS status: SUCCESS status: SUCCESS downstreamAckTimeNanos: 500979, targets: [AA.BB.CC.DD:50010, AA.BB.CC.XX:50010, AA.BB.CC.ZZ:50010]
2016-03-09 13:18:29,962 WARN [ResponseProcessor for block BP-1450007529-AA.BB.CC.QQ-1415122377411:blk_1097162958_23469355] org.apache.hadoop.hdfs.DFSClient: Slow ReadProcessor read fields took 68857ms (threshold=30000ms); ack: seqno: 3547 status: SUCCESS status: SUCCESS status: SUCCESS downstreamAckTimeNanos: 469307, targets: [AA.BB.CC.DD:50010, AA.BB.CC.XX:50010, AA.BB.CC.ZZ:50010]
2016-03-09 13:21:00,774 WARN [ResponseProcessor for block BP-1450007529-AA.BB.CC.QQ-1415122377411:blk_1097162958_23469355] org.apache.hadoop.hdfs.DFSClient: Slow ReadProcessor read fields took 43896ms (threshold=30000ms); ack: seqno: 3569 status: SUCCESS status: SUCCESS status: SUCCESS downstreamAckTimeNanos: 430916, targets: [AA.BB.CC.DD:50010, AA.BB.CC.XX:50010, AA.BB.CC.ZZ:50010]
2016-03-09 13:22:55,910 WARN [ResponseProcessor for block BP-1450007529-AA.BB.CC.QQ-1415122377411:blk_1097162958_23469355] org.apache.hadoop.hdfs.DFSClient: Slow ReadProcessor read fields took 88569ms (threshold=30000ms); ack: seqno: 3581 status: SUCCESS status: SUCCESS status: SUCCESS downstreamAckTimeNanos: 537758, targets: [AA.BB.CC.DD:50010, AA.BB.CC.XX:50010, AA.BB.CC.ZZ:50010]
2016-03-09 13:27:24,885 WARN [ResponseProcessor for block BP-1450007529-AA.BB.CC.QQ-1415122377411:blk_1097162958_23469355] org.apache.hadoop.hdfs.DFSClient: Slow ReadProcessor read fields took 108859ms (threshold=30000ms); ack: seqno: 3617 status: SUCCESS status: SUCCESS status: SUCCESS downstreamAckTimeNanos: 640831, targets: [AA.BB.CC.DD:50010, AA.BB.CC.XX:50010, AA.BB.CC.ZZ:50010]
2016-03-09 14:35:57,241 WARN [ResponseProcessor for block BP-1450007529-AA.BB.CC.QQ-1415122377411:blk_1097162958_23469355] org.apache.hadoop.hdfs.DFSClient: Slow ReadProcessor read fields took 512355ms (threshold=30000ms); ack: seqno: 3619 status: SUCCESS status: SUCCESS status: SUCCESS downstreamAckTimeNanos: 537965, targets: [AA.BB.CC.DD:50010, AA.BB.CC.XX:50010, AA.BB.CC.ZZ:50010]
2016-03-09 14:57:26,177 WARN [ResponseProcessor for block BP-1450007529-AA.BB.CC.QQ-1415122377411:blk_1097162958_23469355] org.apache.hadoop.hdfs.DFSClient: Slow ReadProcessor read fields took 568933ms (threshold=30000ms); ack: seqno: 3621 status: SUCCESS status: SUCCESS status: SUCCESS downstreamAckTimeNanos: 637309, targets: [AA.BB.CC.DD:50010, AA.BB.CC.XX:50010, AA.BB.CC.ZZ:50010]
4 REPLIES 4

avatar
Super Guru

Right now I can think of 3 possible scenario which might create this issue.

1. Check the datanode logs corresponding to that block id, If there is any issue with the disks then probably you will see some messages like below. In case you can't find these messages then you still need take a look at disks IO

WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Slow BlockReceiver write data to disk cost:200ms (threshold=15ms)

OR

WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Slow BlockReceiver write packet to mirror took 986ms (threshold=300ms)

2. check datanode logs and it's GC pattern because a long GC pause can also cause this issue.

3. If datanode doesn't show any disk and GC issues then the networks latency b/w client and datanode could be a culprit, you can use iperf or other network tool to check latency.

avatar
Super Collaborator

Thanks @Jitendra Yadav,

All interesting suggestions, though I haven't been able to chase any of them to a root cause or solution so far.

avatar
Explorer

I'm seeing the same behaviour appear half-way into a larger Pig job.

The HDFS datanodes seem to be in some sort of thrashing mode and I get infinite amount of log messages like:

WARN datanode.DataNode (BlockReceiver.java:receivePacket(525)) - Slow BlockReceiver write packet to mirror took 5104ms (threshold=300ms)

The map/reduce jobs seem to not progress any more and leaving the job overnight does not give progress.

avatar
Contributor

How about check the last contact in those servers.