Support Questions

Find answers, ask questions, and share your expertise
Celebrating as our community reaches 100,000 members! Thank you!

Appmaster failed to launch container in alternate nodemanager after it connection timeout in one NM.

Rising Star

Dear friends,

Need some help to know root cause of this issue.

In a sqoop job failures, it has been noticed that the app master wasn't able to connect to a NM due to connection time out issues and it kept on retrying the connection for close to 2 hrs, until killed manually.
The timeout was due a temporary network issue between appmaster and a nodemanager.

Here is overview of what happend:

RM <-----> NM01(hdpn01) Network ok
RM <-----> NM08(hdpn08) Network ok
NM01 <---X---> NM08 Network failed

AppMaster container launched at NM01 node.

Here is brief log:

INFO [RMCommunicator Allocator] Before Scheduling: PendingReds:0 ScheduledMaps:1 ScheduledReds:0 AssignedMaps:0 AssignedReds:0 CompletedMaps:0 CompletedReds:0 ContAlloc:0 ContRel:0 HostLocal:0 RackLocal:0
2018-02-03 21:12:51,734 INFO [RMCommunicator Allocator] getResources() for application_1517675224254_1052: ask=1 release= 0 newContainers=0 finishedContainers=0 resourcelimit=<memory:2776576, vCores:1> knownNMs=24
2018-02-03 21:12:52,751 INFO [RMCommunicator Allocator] Got allocated containers 1
2018-02-03 21:12:52,793 INFO [RMCommunicator Allocator] org.apache.hadoop.yarn.util.RackResolver: Resolved to /default-rack
2018-02-03 21:12:52,797 INFO [RMCommunicator Allocator] Assigned container container_1517675224254_1052_02_000002 to attempt_1517675224254_1052_m_000000_1000
2018-02-03 21:12:52,799 INFO [RMCommunicator Allocator] After Scheduling: PendingReds:0 ScheduledMaps:0 Sc
2018-02-03 21:24:04,379 INFO [AsyncDispatcher event handler] attempt_1517675224254_1052_m_000000_1000 TaskAttempt Transitioned from ASSIGNED to KILL_CONTAINER_CLEANUP 2018-02-03 21:24:04,380 INFO [AsyncDispatcher event handler] Diagnostics report from attempt_1517675224254_1052_m_000000_1000: Container expired since it was unused 2018-02-03 21:24:04,381 INFO [ContainerLauncher #1] Processing the event EventType: CONTAINER_REMOTE_CLEANUP for container container_1517675224254_1052_02_000002 taskAttempt attempt_1517675224254_1052_m_000000_1000 2018-02-03 21:24:53,616 WARN [ContainerLauncher #0] org.apache.hadoop.ipc.Client: Failed to connect to server: retries get failed due to exceeded maximum allowed retries number: 0 Connection timed out at Method) at at at at at org.apache.hadoop.ipc.Client$Connection.setupConnection( at org.apache.hadoop.ipc.Client$Connection.setupIOstreams( at org.apache.hadoop.ipc.Clien


2018-02-03 22:43:58,911 WARN [ContainerLauncher #0] org.apache.hadoop.ipc.Client: Failed to connect to server: retries get failed due to exceeded maximum allowed retries number: 0 Connection timed out
at Method)

Why did AM keep retrying the connection to the NM on hdpn08 for 2 hrs. till the time it was manually killed? If it wasn’t killed it would have continued for much longer.

Why did AM not stop trying after x number of tries? Is there any max attempt properties for application master?

Why did AM not spin out another map task to compensate for this problematic task?


Rising Star

In this case the source host (Appmaster) was unable to received any response from destination as this is a faulty routing issue and packet was just lost. I checked the source code for the package "org.apache.hadoop.ipc.Client" and found that it uses PING utility to check response from destination host and keep trying until it received any responce. This is clearly mentioned in the this link "This class sends a ping to the remote side when timeout on reading. If no failure is detected, it retries until at least a byte is read." So, due to routing issue and it keep trying until the job was killed.

thanks to for facilitating easy way to read source code.