Dear friends,
Need some help to know root cause of this issue.
In
a sqoop job failures, it has been noticed that the app master wasn't
able to connect to a NM due to connection time out issues and it kept on
retrying the connection for close to 2 hrs, until killed manually.
The timeout was due a temporary network issue between appmaster and a nodemanager.
Here is overview of what happend:
RM <-----> NM01(hdpn01) Network ok
RM <-----> NM08(hdpn08) Network ok
NM01 <---X---> NM08 Network failed
AppMaster container launched at NM01 node.
Here is brief log:
INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator:
Before Scheduling: PendingReds:0 ScheduledMaps:1 ScheduledReds:0
AssignedMaps:0 AssignedReds:0 CompletedMaps:0 CompletedReds:0
ContAlloc:0 ContRel:0 HostLocal:0 RackLocal:0
2018-02-03 21:12:51,734 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor: getResources() for application_1517675224254_1052: ask=1 release= 0 newContainers=0 finishedContainers=0 resourcelimit=<memory:2776576, vCores:1> knownNMs=24
2018-02-03 21:12:52,751 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Got allocated containers 1
2018-02-03 21:12:52,793 INFO [RMCommunicator Allocator] org.apache.hadoop.yarn.util.RackResolver: Resolved
hdpn08.ztpl.net to /default-rack
2018-02-03 21:12:52,797 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned container container_1517675224254_1052_02_000002 to attempt_1517675224254_1052_m_000000_1000
2018-02-03 21:12:52,799 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: PendingReds:0 ScheduledMaps:0 Sc
2018-02-03 21:24:04,379 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: attempt_1517675224254_1052_m_000000_1000 TaskAttempt Transitioned from ASSIGNED to KILL_CONTAINER_CLEANUP
2018-02-03 21:24:04,380 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report from attempt_1517675224254_1052_m_000000_1000: Container expired since it was unused
2018-02-03 21:24:04,381 INFO [ContainerLauncher #1] org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: Processing the event EventType: CONTAINER_REMOTE_CLEANUP for container container_1517675224254_1052_02_000002 taskAttempt attempt_1517675224254_1052_m_000000_1000
2018-02-03 21:24:53,616 WARN [ContainerLauncher #0] org.apache.hadoop.ipc.Client: Failed to connect to server:
hdpn08.ztpl.net/172.20.1.108:45454: retries get failed due to exceeded maximum allowed retries number: 0
java.net.ConnectException: Connection timed out
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)
at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:531)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:495)
at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:650)
at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:745)
at org.apache.hadoop.ipc.Clien
......................
2018-02-03 22:43:58,911 WARN [ContainerLauncher #0] org.apache.hadoop.ipc.Client: Failed to connect to server:
hdpn08.ztpl.net/172.20.1.108:45454: retries get failed due to exceeded maximum allowed retries number: 0
java.net.ConnectException: Connection timed out
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)
at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:531)
Why
did AM keep retrying the connection to the NM on hdpn08 for 2 hrs. till
the time it was manually killed? If it wasn’t killed it would have
continued for much longer.
Why did AM not stop trying after x number of tries? Is there any max attempt properties for application master?
Why did AM not spin out another map task to compensate for this problematic task?