Created 04-22-2019 02:32 PM
Can the Connection timed out on the mapreduce job on client caused by the PendingReplicationMonitor timed out? What can be done to reduce chance the conneciton timed out to happen?
Namenode warning happened about the simular time period as the connection timed out as the client.
WARN org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: PendingReplicationMonitor timed out blk_1132731024_59506975
error on the mapreduce job:
2019-04-22 19:22:53,068 FATAL [IPC Server handler 20 on 34731] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Task: attempt_1555397408307_487639_m_000000_0 - exited : java.net.ConnectException: Call From ClientServerName/10.202.xxx.xxx to NameNodeServerName:9000 failed on connection exception: java.net.ConnectException: Connection timed out; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:791)
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:731)
at org.apache.hadoop.ipc.Client.call(Client.java:1508)
at org.apache.hadoop.ipc.Client.call(Client.java:1441)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230)
at com.sun.proxy.$Proxy14.getBlockLocations(Unknown Source)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getBlockLocations(ClientNamenodeProtocolTranslatorPB.java:268)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:258)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104)
at com.sun.proxy.$Proxy15.getBlockLocations(Unknown Source)
at org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:1324)
at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1311)
at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1299)
at org.apache.hadoop.hdfs.DFSInputStream.fetchLocatedBlocksAndGetLastBlockLength(DFSInputStream.java:315)
at org.apache.hadoop.hdfs.DFSInputStream.openInfo(DFSInputStream.java:280)
at org.apache.hadoop.hdfs.DFSInputStream.<init>(DFSInputStream.java:267)
at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:1630)
at org.apache.hadoop.hdfs.DistributedFileSystem$4.doCall(DistributedFileSystem.java:339)
at org.apache.hadoop.hdfs.DistributedFileSystem$4.doCall(DistributedFileSystem.java:335)
at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:335)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:784)
at org.apache.hadoop.mapred.MapTask.getSplitDetails(MapTask.java:356)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:432)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1924)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
Caused by: java.net.ConnectException: Connection timed out
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:530)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:494)
at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:648)
at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:744)
at org.apache.hadoop.ipc.Client$Connection.access$3000(Client.java:396)
at org.apache.hadoop.ipc.Client.getConnection(Client.java:1557)
at org.apache.hadoop.ipc.Client.call(Client.java:1480)
Created 04-28-2019 08:15 AM
Are you kicking of any hdfs replication ?
Could you provide the cm and cdh version ?
Created 05-03-2019 11:29 AM
no cm installed.
cdh: 2.6.0-cdh5.15.1
Failure:
2019-05-03 13:18:50,472 WARN [main] org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException as:etl (auth:SIMPLE) cause:java.net.ConnectException: Call From clientMachine#14/clientMachine#14IP to NameNode:9000 failed on connection exception: java.net.ConnectException: Connection timed out; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused 2019-05-03 13:18:50,473 WARN [main] org.apache.hadoop.mapred.YarnChild: Exception running child : java.net.ConnectException: Call From clientMachine#14/clientMachine#14IP to NameNode:9000 failed on connection exception: java.net.ConnectException: Connection timed out; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:791) at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:731) at org.apache.hadoop.ipc.Client.call(Client.java:1508) at org.apache.hadoop.ipc.Client.call(Client.java:1441) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230) at com.sun.proxy.$Proxy14.getBlockLocations(Unknown Source) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getBlockLocations(ClientNamenodeProtocolTranslatorPB.java:268) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:258) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104) at com.sun.proxy.$Proxy15.getBlockLocations(Unknown Source) at org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:1324) at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1311) at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1299) at org.apache.hadoop.hdfs.DFSInputStream.fetchLocatedBlocksAndGetLastBlockLength(DFSInputStream.java:315) at org.apache.hadoop.hdfs.DFSInputStream.openInfo(DFSInputStream.java:280) at org.apache.hadoop.hdfs.DFSInputStream.<init>(DFSInputStream.java:267) at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:1630) at org.apache.hadoop.hdfs.DistributedFileSystem$4.doCall(DistributedFileSystem.java:339) at org.apache.hadoop.hdfs.DistributedFileSystem$4.doCall(DistributedFileSystem.java:335) at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) at org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:335) at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:784) at org.apache.hadoop.mapred.MapTask.getSplitDetails(MapTask.java:356) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:760) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1924) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158) Caused by: java.net.ConnectException: Connection timed out at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206) at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:530) at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:494) at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:648) at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:744) at org.apache.hadoop.ipc.Client$Connection.access$3000(Client.java:396) at org.apache.hadoop.ipc.Client.getConnection(Client.java:1557) at org.apache.hadoop.ipc.Client.call(Client.java:1480) ... 31 more
Datanode log on the failed client machine#14 around the failed time:
886389:2019-05-03 13:17:36,651 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: clientMachinej#14:50010:DataXceiver error processing WRITE_BLOCK operation src: /clientIP:36531 dst: /datanode#10IP:50010; org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException: Block BP-1669201578-10.202.33.149-1549049450582:blk_1155667125_83195343 already exists in state FINALIZED and thus cannot be created. 892139:2019-05-03 13:18:22,989 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(clientIP, datanodeUuid=3e35a5fc-4e9b-4548-8631-ab85b5885007, infoPort=50075, infoSecurePort=0, ipcPort=50020, storageInfo=lv=-56;cid=CID-1f8081e6-d4ed-437d-8273-1f2a5edd4057;nsid=338758850;c=0):Exception writing BP-1669201578-10.202.33.149-1549049450582:blk_1155669671_83197947 to mirror datanode#10IP:50010 892140:java.io.IOException: Connection reset by peer 894608:java.io.EOFException: Premature EOF: no length prefix available 894613:2019-05-03 13:18:49,212 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Exception for BP-1669201578-10.202.33.149-1549049450582:blk_1155671740_83200091 894614:java.io.IOException: Premature EOF from inputStream 894626:2019-05-03 13:18:49,212 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: IOException in BlockReceiver.run(): 894627:java.nio.channels.ClosedByInterruptException 894642:java.nio.channels.ClosedByInterruptException 894657:2019-05-03 13:18:49,212 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: opWriteBlock BP-1669201578-10.202.33.149-1549049450582:blk_1155671740_83200091 received exception java.io.IOException: Premature EOF from inputStream 894659:java.io.IOException: Premature EOF from inputStream 894674:java.io.EOFException: Premature EOF: no length prefix available 894679:2019-05-03 13:18:49,227 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: IOException in BlockReceiver.run(): 894680:java.io.IOException: Connection reset by peer 894698:java.io.IOException: Connection reset by peer 894716:2019-05-03 13:18:49,228 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Exception for BP-1669201578-10.202.33.149-1549049450582:blk_1155671508_83199858 894717:java.nio.channels.ClosedByInterruptException 894739:2019-05-03 13:18:49,228 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: opWriteBlock BP-1669201578-10.202.33.149-1549049450582:blk_1155671508_83199858 received exception java.nio.channels.ClosedByInterruptException 894741:java.nio.channels.ClosedByInterruptException 894765:2019-05-03 13:18:49,369 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Exception for BP-1669201578-10.202.33.149-1549049450582:blk_1155671603_83199953 894766:java.io.IOException: Premature EOF from inputStream 894780:2019-05-03 13:18:49,369 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: opWriteBlock BP-1669201578-10.202.33.149-1549049450582:blk_1155671603_83199953 received exception java.io.IOException: Premature EOF from inputStream 894782:java.io.IOException: Premature EOF from inputStream 894796:2019-05-03 13:18:49,532 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Exception for BP-1669201578-10.202.33.149-1549049450582:blk_1155671632_83199982 894797:java.io.IOException: Premature EOF from inputStream 894810:java.io.EOFException: Premature EOF: no length prefix available 894817:2019-05-03 13:18:49,532 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: opWriteBlock BP-1669201578-10.202.33.149-1549049450582:blk_1155671632_83199982 received exception java.io.IOException: Premature EOF from inputStream 894819:java.io.IOException: Premature EOF from inputStream
There was some GC pause on namenode machine#9, but fixed by increase the Heap Memory.
129,238,586 files and directories, 31,709,865 blocks = 160,948,451 total filesystem object(s). Heap Memory used 33.34 GB of 101.15 GB Heap Memory. Max Heap Memory is 101.15 GB.
On the machine #9 datanode about the same time:
2019-05-03 13:18:42,146 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder: BP-1669201578-10.202.33.149-1549049450582:blk_1155671398_83199748, type=LAST_IN_PIPELINE, downstreams=0:[] terminating 2019-05-03 13:18:42,148 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder: BP-1669201578-10.202.33.149-1549049450582:blk_1155671637_83199987, type=HAS_DOWNSTREAM_IN_PIPELINE terminating 2019-05-03 13:18:42,149 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving BP-1669201578-10.202.33.149-1549049450582:blk_1155671717_83200068 src: /machine#12:33498 dest: /machine#9:50010 2019-05-03 13:18:42,492 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Slow BlockReceiver write packet to mirror took 1096ms (threshold=300ms) 2019-05-03 13:18:42,521 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Slow BlockReceiver write packet to mirror took 1163ms (threshold=300ms) 2019-05-03 13:18:54,592 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Slow BlockReceiver write packet to mirror took 508ms (threshold=300ms) 2019-05-03 13:18:55,892 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Slow BlockReceiver write packet to mirror took 14349ms (threshold=300ms) 2019-05-03 13:18:55,734 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Slow BlockReceiver write packet to mirror took 338ms (threshold=300ms) 2019-05-03 13:18:55,404 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Slow BlockReceiver write packet to mirror took 585ms (threshold=300ms) 2019-05-03 13:18:55,352 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Slow BlockReceiver write packet to mirror took 13225ms (threshold=300ms) 2019-05-03 13:18:55,094 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Slow BlockReceiver write packet to mirror took 501ms (threshold=300ms) 2019-05-03 13:18:55,223 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Slow BlockReceiver write packet to mirror took 2067ms (threshold=300ms) 2019-05-03 13:18:56,730 INFO org.apache.hadoop.util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 4850ms
No GCs detected
2019-05-03 13:18:55,005 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Slow BlockReceiver write packet to mirror took 1025ms (threshold=300ms)
2019-05-03 13:18:54,957 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Slow BlockReceiver write packet to mirror took 850ms (threshold=300ms)
2019-05-03 13:18:54,796 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Slow BlockReceiver write packet to mirror took 12170ms (threshold=300ms)
2019-05-03 13:18:54,634 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Slow BlockReceiver write packet to mirror took 504ms (threshold=300ms)
2019-05-03 13:18:56,107 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Slow BlockReceiver write packet to mirror took 13561ms (threshold=300ms)
2019-05-03 13:18:57,012 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Slow PacketResponder send ack to upstream took 1512ms (threshold=300ms), PacketResponder: BP-1669201578-10.202.33.149-1549049450582:blk_1155670553_83198851, type=HAS_DOWNSTREAM_IN_PIPELINE, replyAck=seqno: 1704 reply: 0 reply: 0 downstreamAckTimeNanos: 9451006117
2019-05-03 13:18:57,036 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Slow PacketResponder send ack to upstream took 392ms (threshold=300ms), PacketResponder: BP-1669201578-10.202.33.149-1549049450582:blk_1155671880_83200232, type=LAST_IN_PIPELINE, downstreams=0:[], replyAck=seqno: 67 reply: 0 downstreamAckTimeNanos: 0
2019-05-03 13:18:57,068 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Slow PacketResponder send ack to upstream took 600ms (threshold=300ms), PacketResponder: BP-1669201578-10.202.33.149-1549049450582:blk_1155671512_83199862, type=HAS_DOWNSTREAM_IN_PIPELINE, replyAck=seqno: 19 reply: 0 reply: 0 reply: 0 downstreamAckTimeNanos: 11402140530
2019-05-03 13:18:57,066 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Slow PacketResponder send ack to upstream took 2563ms (threshold=300ms), PacketResponder: BP-1669201578-10.202.33.149-1549049450582:blk_1155671711_83200062, type=LAST_IN_PIPELINE, downstreams=0:[], replyAck=seqno: 153 reply: 0 downstreamAckTimeNanos: 0