Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

mapreduce job ConnectException: Connection timed out to the namenode.

avatar
New Contributor

Can the Connection timed out on the mapreduce job on client caused by the PendingReplicationMonitor timed out?   What can be done to reduce chance the conneciton timed out to happen?

 

Namenode warning happened about the simular time period as the connection timed out as the client.  
WARN org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: PendingReplicationMonitor timed out blk_1132731024_59506975


error on the mapreduce job:

2019-04-22 19:22:53,068 FATAL [IPC Server handler 20 on 34731] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Task: attempt_1555397408307_487639_m_000000_0 - exited : java.net.ConnectException: Call From ClientServerName/10.202.xxx.xxx to NameNodeServerName:9000 failed on connection exception: java.net.ConnectException: Connection timed out; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:791)
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:731)
at org.apache.hadoop.ipc.Client.call(Client.java:1508)
at org.apache.hadoop.ipc.Client.call(Client.java:1441)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230)
at com.sun.proxy.$Proxy14.getBlockLocations(Unknown Source)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getBlockLocations(ClientNamenodeProtocolTranslatorPB.java:268)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:258)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104)
at com.sun.proxy.$Proxy15.getBlockLocations(Unknown Source)
at org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:1324)
at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1311)
at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1299)
at org.apache.hadoop.hdfs.DFSInputStream.fetchLocatedBlocksAndGetLastBlockLength(DFSInputStream.java:315)
at org.apache.hadoop.hdfs.DFSInputStream.openInfo(DFSInputStream.java:280)
at org.apache.hadoop.hdfs.DFSInputStream.<init>(DFSInputStream.java:267)
at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:1630)
at org.apache.hadoop.hdfs.DistributedFileSystem$4.doCall(DistributedFileSystem.java:339)
at org.apache.hadoop.hdfs.DistributedFileSystem$4.doCall(DistributedFileSystem.java:335)
at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:335)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:784)
at org.apache.hadoop.mapred.MapTask.getSplitDetails(MapTask.java:356)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:432)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1924)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
Caused by: java.net.ConnectException: Connection timed out
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:530)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:494)
at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:648)
at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:744)
at org.apache.hadoop.ipc.Client$Connection.access$3000(Client.java:396)
at org.apache.hadoop.ipc.Client.getConnection(Client.java:1557)
at org.apache.hadoop.ipc.Client.call(Client.java:1480)

 

 

2 REPLIES 2

avatar
Champion

Are you kicking of any hdfs replication ? 

Could you provide the cm and cdh version ? 

avatar
New Contributor

no cm installed.

cdh: 2.6.0-cdh5.15.1

 

Failure:

2019-05-03 13:18:50,472 WARN [main] org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException as:etl (auth:SIMPLE) cause:java.net.ConnectException: Call From clientMachine#14/clientMachine#14IP to NameNode:9000 failed on connection exception: java.net.ConnectException: Connection timed out; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused
2019-05-03 13:18:50,473 WARN [main] org.apache.hadoop.mapred.YarnChild: Exception running child : java.net.ConnectException: Call From clientMachine#14/clientMachine#14IP to NameNode:9000 failed on connection exception: java.net.ConnectException: Connection timed out; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused
	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
	at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
	at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:791)
	at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:731)
	at org.apache.hadoop.ipc.Client.call(Client.java:1508)
	at org.apache.hadoop.ipc.Client.call(Client.java:1441)
	at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230)
	at com.sun.proxy.$Proxy14.getBlockLocations(Unknown Source)
	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getBlockLocations(ClientNamenodeProtocolTranslatorPB.java:268)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:258)
	at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104)
	at com.sun.proxy.$Proxy15.getBlockLocations(Unknown Source)
	at org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:1324)
	at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1311)
	at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1299)
	at org.apache.hadoop.hdfs.DFSInputStream.fetchLocatedBlocksAndGetLastBlockLength(DFSInputStream.java:315)
	at org.apache.hadoop.hdfs.DFSInputStream.openInfo(DFSInputStream.java:280)
	at org.apache.hadoop.hdfs.DFSInputStream.<init>(DFSInputStream.java:267)
	at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:1630)
	at org.apache.hadoop.hdfs.DistributedFileSystem$4.doCall(DistributedFileSystem.java:339)
	at org.apache.hadoop.hdfs.DistributedFileSystem$4.doCall(DistributedFileSystem.java:335)
	at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
	at org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:335)
	at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:784)
	at org.apache.hadoop.mapred.MapTask.getSplitDetails(MapTask.java:356)
	at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:760)
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
	at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1924)
	at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
Caused by: java.net.ConnectException: Connection timed out
	at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
	at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
	at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
	at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:530)
	at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:494)
	at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:648)
	at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:744)
	at org.apache.hadoop.ipc.Client$Connection.access$3000(Client.java:396)
	at org.apache.hadoop.ipc.Client.getConnection(Client.java:1557)
	at org.apache.hadoop.ipc.Client.call(Client.java:1480)
	... 31 more

 

Datanode log on the failed client machine#14 around the failed time:

886389:2019-05-03 13:17:36,651 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: clientMachinej#14:50010:DataXceiver error processing WRITE_BLOCK operation  src: /clientIP:36531 dst: /datanode#10IP:50010; org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException: Block BP-1669201578-10.202.33.149-1549049450582:blk_1155667125_83195343 already exists in state FINALIZED and thus cannot be created.
892139:2019-05-03 13:18:22,989 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(clientIP, datanodeUuid=3e35a5fc-4e9b-4548-8631-ab85b5885007, infoPort=50075, infoSecurePort=0, ipcPort=50020, storageInfo=lv=-56;cid=CID-1f8081e6-d4ed-437d-8273-1f2a5edd4057;nsid=338758850;c=0):Exception writing BP-1669201578-10.202.33.149-1549049450582:blk_1155669671_83197947 to mirror datanode#10IP:50010
892140:java.io.IOException: Connection reset by peer
894608:java.io.EOFException: Premature EOF: no length prefix available
894613:2019-05-03 13:18:49,212 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Exception for BP-1669201578-10.202.33.149-1549049450582:blk_1155671740_83200091
894614:java.io.IOException: Premature EOF from inputStream
894626:2019-05-03 13:18:49,212 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: IOException in BlockReceiver.run():
894627:java.nio.channels.ClosedByInterruptException
894642:java.nio.channels.ClosedByInterruptException
894657:2019-05-03 13:18:49,212 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: opWriteBlock BP-1669201578-10.202.33.149-1549049450582:blk_1155671740_83200091 received exception java.io.IOException: Premature EOF from inputStream
894659:java.io.IOException: Premature EOF from inputStream
894674:java.io.EOFException: Premature EOF: no length prefix available
894679:2019-05-03 13:18:49,227 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: IOException in BlockReceiver.run():
894680:java.io.IOException: Connection reset by peer
894698:java.io.IOException: Connection reset by peer
894716:2019-05-03 13:18:49,228 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Exception for BP-1669201578-10.202.33.149-1549049450582:blk_1155671508_83199858
894717:java.nio.channels.ClosedByInterruptException
894739:2019-05-03 13:18:49,228 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: opWriteBlock BP-1669201578-10.202.33.149-1549049450582:blk_1155671508_83199858 received exception java.nio.channels.ClosedByInterruptException
894741:java.nio.channels.ClosedByInterruptException
894765:2019-05-03 13:18:49,369 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Exception for BP-1669201578-10.202.33.149-1549049450582:blk_1155671603_83199953
894766:java.io.IOException: Premature EOF from inputStream
894780:2019-05-03 13:18:49,369 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: opWriteBlock BP-1669201578-10.202.33.149-1549049450582:blk_1155671603_83199953 received exception java.io.IOException: Premature EOF from inputStream
894782:java.io.IOException: Premature EOF from inputStream
894796:2019-05-03 13:18:49,532 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Exception for BP-1669201578-10.202.33.149-1549049450582:blk_1155671632_83199982
894797:java.io.IOException: Premature EOF from inputStream
894810:java.io.EOFException: Premature EOF: no length prefix available
894817:2019-05-03 13:18:49,532 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: opWriteBlock BP-1669201578-10.202.33.149-1549049450582:blk_1155671632_83199982 received exception java.io.IOException: Premature EOF from inputStream
894819:java.io.IOException: Premature EOF from inputStream

There was some GC pause on namenode machine#9, but fixed by increase the Heap Memory.

129,238,586 files and directories, 31,709,865 blocks = 160,948,451 total filesystem object(s).

Heap Memory used 33.34 GB of 101.15 GB Heap Memory. Max Heap Memory is 101.15 GB.

On the machine #9 datanode about the same time:

2019-05-03 13:18:42,146 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder: BP-1669201578-10.202.33.149-1549049450582:blk_1155671398_83199748, type=LAST_IN_PIPELINE, downstreams=0:[] terminating
2019-05-03 13:18:42,148 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder: BP-1669201578-10.202.33.149-1549049450582:blk_1155671637_83199987, type=HAS_DOWNSTREAM_IN_PIPELINE terminating
2019-05-03 13:18:42,149 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving BP-1669201578-10.202.33.149-1549049450582:blk_1155671717_83200068 src: /machine#12:33498 dest: /machine#9:50010
2019-05-03 13:18:42,492 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Slow BlockReceiver write packet to mirror took 1096ms (threshold=300ms)
2019-05-03 13:18:42,521 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Slow BlockReceiver write packet to mirror took 1163ms (threshold=300ms)
2019-05-03 13:18:54,592 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Slow BlockReceiver write packet to mirror took 508ms (threshold=300ms)
2019-05-03 13:18:55,892 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Slow BlockReceiver write packet to mirror took 14349ms (threshold=300ms)
2019-05-03 13:18:55,734 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Slow BlockReceiver write packet to mirror took 338ms (threshold=300ms)
2019-05-03 13:18:55,404 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Slow BlockReceiver write packet to mirror took 585ms (threshold=300ms)
2019-05-03 13:18:55,352 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Slow BlockReceiver write packet to mirror took 13225ms (threshold=300ms)
2019-05-03 13:18:55,094 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Slow BlockReceiver write packet to mirror took 501ms (threshold=300ms)
2019-05-03 13:18:55,223 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Slow BlockReceiver write packet to mirror took 2067ms (threshold=300ms)
2019-05-03 13:18:56,730 INFO org.apache.hadoop.util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 4850ms
No GCs detected
2019-05-03 13:18:55,005 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Slow BlockReceiver write packet to mirror took 1025ms (threshold=300ms)
2019-05-03 13:18:54,957 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Slow BlockReceiver write packet to mirror took 850ms (threshold=300ms)
2019-05-03 13:18:54,796 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Slow BlockReceiver write packet to mirror took 12170ms (threshold=300ms)
2019-05-03 13:18:54,634 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Slow BlockReceiver write packet to mirror took 504ms (threshold=300ms)
2019-05-03 13:18:56,107 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Slow BlockReceiver write packet to mirror took 13561ms (threshold=300ms)
2019-05-03 13:18:57,012 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Slow PacketResponder send ack to upstream took 1512ms (threshold=300ms), PacketResponder: BP-1669201578-10.202.33.149-1549049450582:blk_1155670553_83198851, type=HAS_DOWNSTREAM_IN_PIPELINE, replyAck=seqno: 1704 reply: 0 reply: 0 downstreamAckTimeNanos: 9451006117
2019-05-03 13:18:57,036 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Slow PacketResponder send ack to upstream took 392ms (threshold=300ms), PacketResponder: BP-1669201578-10.202.33.149-1549049450582:blk_1155671880_83200232, type=LAST_IN_PIPELINE, downstreams=0:[], replyAck=seqno: 67 reply: 0 downstreamAckTimeNanos: 0
2019-05-03 13:18:57,068 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Slow PacketResponder send ack to upstream took 600ms (threshold=300ms), PacketResponder: BP-1669201578-10.202.33.149-1549049450582:blk_1155671512_83199862, type=HAS_DOWNSTREAM_IN_PIPELINE, replyAck=seqno: 19 reply: 0 reply: 0 reply: 0 downstreamAckTimeNanos: 11402140530
2019-05-03 13:18:57,066 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Slow PacketResponder send ack to upstream took 2563ms (threshold=300ms), PacketResponder: BP-1669201578-10.202.33.149-1549049450582:blk_1155671711_83200062, type=LAST_IN_PIPELINE, downstreams=0:[], replyAck=seqno: 153 reply: 0 downstreamAckTimeNanos: 0