Support Questions
Find answers, ask questions, and share your expertise

map/reduce task connection timeout with application master

map/reduce task connection timeout with application master

Expert Contributor

recently, there are a few jobs failed randomly everyday because the map/reduce task connection timeout with app master. 

1618560991643.jpg

the below Jos all are failed: 

1618560833333.jpg

one of containers logs as below:

 

 

2021-04-16 14:29:19,845 INFO [main] org.apache.hadoop.metrics2.impl.MetricsConfig: loaded properties from hadoop-metrics2.properties
2021-04-16 14:29:19,895 INFO [main] org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Scheduled snapshot period at 10 second(s).
2021-04-16 14:29:19,895 INFO [main] org.apache.hadoop.metrics2.impl.MetricsSystemImpl: MapTask metrics system started
2021-04-16 14:29:19,896 INFO [main] org.apache.hadoop.mapred.YarnChild: Executing with tokens:
2021-04-16 14:29:19,896 INFO [main] org.apache.hadoop.mapred.YarnChild: Kind: mapreduce.job, Service: job_1618548626214_0723, Ident: (org.apache.hadoop.mapreduce.security.token.JobTokenIdentifier@732c2a62)
2021-04-16 14:29:20,071 INFO [main] org.apache.hadoop.mapred.YarnChild: Sleeping for 0ms before retrying again. Got null now.
2021-04-16 14:29:24,116 INFO [main] org.apache.hadoop.ipc.Client: Retrying connect to server: dataware-17/10.39.58.15:1872. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2021-04-16 14:29:28,116 INFO [main] org.apache.hadoop.ipc.Client: Retrying connect to server: dataware-17/10.39.58.15:1872. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2021-04-16 14:29:32,117 INFO [main] org.apache.hadoop.ipc.Client: Retrying connect to server: dataware-17/10.39.58.15:1872. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2021-04-16 14:29:36,117 INFO [main] org.apache.hadoop.ipc.Client: Retrying connect to server: dataware-17/10.39.58.15:1872. Already tried 3 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2021-04-16 14:29:40,117 INFO [main] org.apache.hadoop.ipc.Client: Retrying connect to server: dataware-17/10.39.58.15:1872. Already tried 4 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2021-04-16 14:29:44,118 INFO [main] org.apache.hadoop.ipc.Client: Retrying connect to server: dataware-17/10.39.58.15:1872. Already tried 5 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2021-04-16 14:29:48,120 INFO [main] org.apache.hadoop.ipc.Client: Retrying connect to server: dataware-17/10.39.58.15:1872. Already tried 6 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2021-04-16 14:29:52,120 INFO [main] org.apache.hadoop.ipc.Client: Retrying connect to server: dataware-17/10.39.58.15:1872. Already tried 7 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2021-04-16 14:29:56,121 INFO [main] org.apache.hadoop.ipc.Client: Retrying connect to server: dataware-17/10.39.58.15:1872. Already tried 8 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2021-04-16 14:30:00,120 INFO [main] org.apache.hadoop.ipc.Client: Retrying connect to server: dataware-17/10.39.58.15:1872. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2021-04-16 14:30:03,122 WARN [main] org.apache.hadoop.mapred.YarnChild: Exception running child : java.net.ConnectException: Call From dataware-11/10.39.58.22 to dataware-17:1872 failed on connection exception: java.net.ConnectException: Connection timed out; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused
	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
	at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
	at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:791)
	at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:731)
	at org.apache.hadoop.ipc.Client.call(Client.java:1508)
	at org.apache.hadoop.ipc.Client.call(Client.java:1441)
	at org.apache.hadoop.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:246)
	at com.sun.proxy.$Proxy9.getTask(Unknown Source)
	at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:132)
Caused by: java.net.ConnectException: Connection timed out
	at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
	at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
	at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
	at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:530)
	at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:494)
	at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:648)
	at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:744)
	at org.apache.hadoop.ipc.Client$Connection.access$3000(Client.java:396)
	at org.apache.hadoop.ipc.Client.getConnection(Client.java:1557)
	at org.apache.hadoop.ipc.Client.call(Client.java:1480)
	... 4 more

 

 

 

the most important logs is :

 

 

Exception running child : java.net.ConnectException: Call From dataware-11/10.39.58.22 to dataware-17:1872 failed on connection exception:

 

 

dataware-17 is an application master, dataware-11 is a task, I have already worked on this case two days but can't find root cause.   could you please give me some clues how to solve this kind problems.