Reply
Explorer
Posts: 8
Registered: ‎05-31-2016
Accepted Solution

MapReduce client retrying to connect after job completion

Running on Hadoop 2.6.0-cdh5.7.0 and issuing a simple Pig script.
After a successful job completion I'm getting the following message :

Seems like the workers are trying to communicate with each other (with a maximum of 3 retries) but I'm not sure why, and where this behavior is configured.

Does anyone know how to solve this issue ?

 

Counters:
Total records written : 46933
Total bytes written : 12822705
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0

Job DAG:
job_1469941650260_0002  ->      job_1469941650260_0011,
job_1469941650260_0003  ->      job_1469941650260_0011,
job_1469941650260_0001  ->      job_1469941650260_0005,job_1469941650260_0006,
job_1469941650260_0005  ->      job_1469941650260_0006,
job_1469941650260_0006  ->      job_1469941650260_0007,
job_1469941650260_0007  ->      job_1469941650260_0008,job_1469941650260_0009,
job_1469941650260_0004  ->      job_1469941650260_0008,
job_1469941650260_0008  ->      job_1469941650260_0010,
job_1469941650260_0010  ->      job_1469941650260_0011,
job_1469941650260_0009  ->      job_1469941650260_0011,
job_1469941650260_0011


2016-07-31 05:28:54,418 [MainThread] INFO  org.apache.hadoop.ipc.Client - Retrying connect to server: cdh-worker-p1.c.project.internal/10.240.0.22:38762. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=3, sleepTime=1000 MILLISECONDS)
2016-07-31 05:28:55,419 [MainThread] INFO  org.apache.hadoop.ipc.Client - Retrying connect to server: cdh-worker-p1.c.project.internal/10.240.0.22:38762. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=3, sleepTime=1000 MILLISECONDS)
2016-07-31 05:28:56,420 [MainThread] INFO  org.apache.hadoop.ipc.Client - Retrying connect to server: cdh-worker-p1.c.project.internal/10.240.0.22:38762. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=3, sleepTime=1000 MILLISECONDS)
2016-07-31 05:28:56,527 [MainThread] INFO  org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
2016-07-31 05:28:57,626 [MainThread] INFO  org.apache.hadoop.ipc.Client - Retrying connect to server: cdh-worker-p2.c.project.internal/10.240.0.17:35325. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=3, sleepTime=1000 MILLISECONDS)
2016-07-31 05:28:58,628 [MainThread] INFO  org.apache.hadoop.ipc.Client - Retrying connect to server: cdh-worker-p2.c.project.internal/10.240.0.17:35325. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=3, sleepTime=1000 MILLISECONDS)
2016-07-31 05:28:59,629 [MainThread] INFO  org.apache.hadoop.ipc.Client - Retrying connect to server: cdh-worker-p2.c.project.internal/10.240.0.17:35325. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=3, sleepTime=1000 MILLISECONDS)
2016-07-31 05:28:59,732 [MainThread] INFO  org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
2016-07-31 05:29:00,833 [MainThread] INFO  org.apache.hadoop.ipc.Client - Retrying connect to server: cdh-worker3.c.project.internal/10.240.0.25:45573. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=3, sleepTime=1000 MILLISECONDS)
2016-07-31 05:29:01,834 [MainThread] INFO  org.apache.hadoop.ipc.Client - Retrying connect to server: cdh-worker3.c.project.internal/10.240.0.25:45573. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=3, sleepTime=1000 MILLISECONDS)
2016-07-31 05:29:02,835 [MainThread] INFO  org.apache.hadoop.ipc.Client - Retrying connect to server: cdh-worker3.c.project.internal/10.240.0.25:45573. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=3, sleepTime=1000 MILLISECONDS)
2016-07-31 05:29:02,939 [MainThread] INFO  org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
2016-07-31 05:29:04,051 [MainThread] INFO  org.apache.hadoop.ipc.Client - Retrying connect to server: cdh-worker2.c.project.internal/10.240.0.24:36934. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=3, sleepTime=1000 MILLISECONDS)
2016-07-31 05:29:05,052 [MainThread] INFO  org.apache.hadoop.ipc.Client - Retrying connect to server: cdh-worker2.c.project.internal/10.240.0.24:36934. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=3, sleepTime=1000 MILLISECONDS)
2016-07-31 05:29:06,053 [MainThread] INFO  org.apache.hadoop.ipc.Client - Retrying connect to server: cdh-worker2.c.project.internal/10.240.0.24:36934. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=3, sleepTime=1000 MILLISECONDS)
2016-07-31 05:29:06,157 [MainThread] INFO  org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
2016-07-31 05:29:07,244 [MainThread] INFO  org.apache.hadoop.ipc.Client - Retrying connect to server: cdh-worker2.c.project.internal/10.240.0.24:43862. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=3, sleepTime=1000 MILLISECONDS)
2016-07-31 05:29:08,245 [MainThread] INFO  org.apache.hadoop.ipc.Client - Retrying connect to server: cdh-worker2.c.project.internal/10.240.0.24:43862. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=3, sleepTime=1000 MILLISECONDS)
2016-07-31 05:29:09,246 [MainThread] INFO  org.apache.hadoop.ipc.Client - Retrying connect to server: cdh-worker2.c.project.internal/10.240.0.24:43862. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=3, sleepTime=1000 MILLISECONDS)
2016-07-31 05:29:09,350 [MainThread] INFO  org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
2016-07-31 05:29:10,643 [MainThread] INFO  org.apache.hadoop.ipc.Client - Retrying connect to server: cdh-worker3.c.project.internal/10.240.0.25:38481. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=3, sleepTime=1000 MILLISECONDS)
2016-07-31 05:29:11,644 [MainThread] INFO  org.apache.hadoop.ipc.Client - Retrying connect to server: cdh-worker3.c.project.internal/10.240.0.25:38481. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=3, sleepTime=1000 MILLISECONDS)
2016-07-31 05:29:12,645 [MainThread] INFO  org.apache.hadoop.ipc.Client - Retrying connect to server: cdh-worker3.c.project.internal/10.240.0.25:38481. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=3, sleepTime=1000 MILLISECONDS)
2016-07-31 05:29:12,749 [MainThread] INFO  org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
2016-07-31 05:29:13,832 [MainThread] INFO  org.apache.hadoop.ipc.Client - Retrying connect to server: cdh-worker-p2.c.project.internal/10.240.0.17:34431. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=3, sleepTime=1000 MILLISECONDS)
2016-07-31 05:29:14,833 [MainThread] INFO  org.apache.hadoop.ipc.Client - Retrying connect to server: cdh-worker-p2.c.project.internal/10.240.0.17:34431. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=3, sleepTime=1000 MILLISECONDS)
2016-07-31 05:29:15,834 [MainThread] INFO  org.apache.hadoop.ipc.Client - Retrying connect to server: cdh-worker-p2.c.project.internal/10.240.0.17:34431. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=3, sleepTime=1000 MILLISECONDS)
2016-07-31 05:29:15,937 [MainThread] INFO  org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
2016-07-31 05:29:17,045 [MainThread] INFO  org.apache.hadoop.ipc.Client - Retrying connect to server: cdh-worker1.c.project.internal/10.240.0.27:38757. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=3, sleepTime=1000 MILLISECONDS)
2016-07-31 05:29:18,046 [MainThread] INFO  org.apache.hadoop.ipc.Client - Retrying connect to server: cdh-worker1.c.project.internal/10.240.0.27:38757. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=3, sleepTime=1000 MILLISECONDS)
2016-07-31 05:29:19,047 [MainThread] INFO  org.apache.hadoop.ipc.Client - Retrying connect to server: cdh-worker1.c.project.internal/10.240.0.27:38757. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=3, sleepTime=1000 MILLISECONDS)
2016-07-31 05:29:19,149 [MainThread] INFO  org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
2016-07-31 05:29:20,230 [MainThread] INFO  org.apache.hadoop.ipc.Client - Retrying connect to server: cdh-worker3.c.project.internal/10.240.0.25:37952. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=3, sleepTime=1000 MILLISECONDS)
2016-07-31 05:29:21,231 [MainThread] INFO  org.apache.hadoop.ipc.Client - Retrying connect to server: cdh-worker3.c.project.internal/10.240.0.25:37952. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=3, sleepTime=1000 MILLISECONDS)
2016-07-31 05:29:22,232 [MainThread] INFO  org.apache.hadoop.ipc.Client - Retrying connect to server: cdh-worker3.c.project.internal/10.240.0.25:37952. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=3, sleepTime=1000 MILLISECONDS)
2016-07-31 05:29:22,335 [MainThread] INFO  org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
2016-07-31 05:29:22,417 [MainThread] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success!
Posts: 1,903
Kudos: 435
Solutions: 307
Registered: ‎07-31-2013

Re: MapReduce client retrying to connect after job completion

This is standard behaviour currently. The retries of 3 times to reach an MR2 AM is controlled via "yarn.app.mapreduce.client-am.ipc.max-retries".

If the client cannot reach the MR2 AM, which is the definitive source of truth during application execution, then it falls back to asking the RM which may answer the state and the redirection if the application has completed.
Highlighted
Explorer
Posts: 8
Registered: ‎05-31-2016

Re: MapReduce client retrying to connect after job completion

Thanks, good to know.