Support Questions

Find answers, ask questions, and share your expertise

MapReduce client retrying to connect after job completion

avatar
Explorer

Running on Hadoop 2.6.0-cdh5.7.0 and issuing a simple Pig script.
After a successful job completion I'm getting the following message :

Seems like the workers are trying to communicate with each other (with a maximum of 3 retries) but I'm not sure why, and where this behavior is configured.

Does anyone know how to solve this issue ?

 

Counters:
Total records written : 46933
Total bytes written : 12822705
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0

Job DAG:
job_1469941650260_0002  ->      job_1469941650260_0011,
job_1469941650260_0003  ->      job_1469941650260_0011,
job_1469941650260_0001  ->      job_1469941650260_0005,job_1469941650260_0006,
job_1469941650260_0005  ->      job_1469941650260_0006,
job_1469941650260_0006  ->      job_1469941650260_0007,
job_1469941650260_0007  ->      job_1469941650260_0008,job_1469941650260_0009,
job_1469941650260_0004  ->      job_1469941650260_0008,
job_1469941650260_0008  ->      job_1469941650260_0010,
job_1469941650260_0010  ->      job_1469941650260_0011,
job_1469941650260_0009  ->      job_1469941650260_0011,
job_1469941650260_0011


2016-07-31 05:28:54,418 [MainThread] INFO  org.apache.hadoop.ipc.Client - Retrying connect to server: cdh-worker-p1.c.project.internal/10.240.0.22:38762. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=3, sleepTime=1000 MILLISECONDS)
2016-07-31 05:28:55,419 [MainThread] INFO  org.apache.hadoop.ipc.Client - Retrying connect to server: cdh-worker-p1.c.project.internal/10.240.0.22:38762. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=3, sleepTime=1000 MILLISECONDS)
2016-07-31 05:28:56,420 [MainThread] INFO  org.apache.hadoop.ipc.Client - Retrying connect to server: cdh-worker-p1.c.project.internal/10.240.0.22:38762. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=3, sleepTime=1000 MILLISECONDS)
2016-07-31 05:28:56,527 [MainThread] INFO  org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
2016-07-31 05:28:57,626 [MainThread] INFO  org.apache.hadoop.ipc.Client - Retrying connect to server: cdh-worker-p2.c.project.internal/10.240.0.17:35325. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=3, sleepTime=1000 MILLISECONDS)
2016-07-31 05:28:58,628 [MainThread] INFO  org.apache.hadoop.ipc.Client - Retrying connect to server: cdh-worker-p2.c.project.internal/10.240.0.17:35325. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=3, sleepTime=1000 MILLISECONDS)
2016-07-31 05:28:59,629 [MainThread] INFO  org.apache.hadoop.ipc.Client - Retrying connect to server: cdh-worker-p2.c.project.internal/10.240.0.17:35325. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=3, sleepTime=1000 MILLISECONDS)
2016-07-31 05:28:59,732 [MainThread] INFO  org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
2016-07-31 05:29:00,833 [MainThread] INFO  org.apache.hadoop.ipc.Client - Retrying connect to server: cdh-worker3.c.project.internal/10.240.0.25:45573. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=3, sleepTime=1000 MILLISECONDS)
2016-07-31 05:29:01,834 [MainThread] INFO  org.apache.hadoop.ipc.Client - Retrying connect to server: cdh-worker3.c.project.internal/10.240.0.25:45573. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=3, sleepTime=1000 MILLISECONDS)
2016-07-31 05:29:02,835 [MainThread] INFO  org.apache.hadoop.ipc.Client - Retrying connect to server: cdh-worker3.c.project.internal/10.240.0.25:45573. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=3, sleepTime=1000 MILLISECONDS)
2016-07-31 05:29:02,939 [MainThread] INFO  org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
2016-07-31 05:29:04,051 [MainThread] INFO  org.apache.hadoop.ipc.Client - Retrying connect to server: cdh-worker2.c.project.internal/10.240.0.24:36934. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=3, sleepTime=1000 MILLISECONDS)
2016-07-31 05:29:05,052 [MainThread] INFO  org.apache.hadoop.ipc.Client - Retrying connect to server: cdh-worker2.c.project.internal/10.240.0.24:36934. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=3, sleepTime=1000 MILLISECONDS)
2016-07-31 05:29:06,053 [MainThread] INFO  org.apache.hadoop.ipc.Client - Retrying connect to server: cdh-worker2.c.project.internal/10.240.0.24:36934. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=3, sleepTime=1000 MILLISECONDS)
2016-07-31 05:29:06,157 [MainThread] INFO  org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
2016-07-31 05:29:07,244 [MainThread] INFO  org.apache.hadoop.ipc.Client - Retrying connect to server: cdh-worker2.c.project.internal/10.240.0.24:43862. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=3, sleepTime=1000 MILLISECONDS)
2016-07-31 05:29:08,245 [MainThread] INFO  org.apache.hadoop.ipc.Client - Retrying connect to server: cdh-worker2.c.project.internal/10.240.0.24:43862. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=3, sleepTime=1000 MILLISECONDS)
2016-07-31 05:29:09,246 [MainThread] INFO  org.apache.hadoop.ipc.Client - Retrying connect to server: cdh-worker2.c.project.internal/10.240.0.24:43862. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=3, sleepTime=1000 MILLISECONDS)
2016-07-31 05:29:09,350 [MainThread] INFO  org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
2016-07-31 05:29:10,643 [MainThread] INFO  org.apache.hadoop.ipc.Client - Retrying connect to server: cdh-worker3.c.project.internal/10.240.0.25:38481. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=3, sleepTime=1000 MILLISECONDS)
2016-07-31 05:29:11,644 [MainThread] INFO  org.apache.hadoop.ipc.Client - Retrying connect to server: cdh-worker3.c.project.internal/10.240.0.25:38481. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=3, sleepTime=1000 MILLISECONDS)
2016-07-31 05:29:12,645 [MainThread] INFO  org.apache.hadoop.ipc.Client - Retrying connect to server: cdh-worker3.c.project.internal/10.240.0.25:38481. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=3, sleepTime=1000 MILLISECONDS)
2016-07-31 05:29:12,749 [MainThread] INFO  org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
2016-07-31 05:29:13,832 [MainThread] INFO  org.apache.hadoop.ipc.Client - Retrying connect to server: cdh-worker-p2.c.project.internal/10.240.0.17:34431. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=3, sleepTime=1000 MILLISECONDS)
2016-07-31 05:29:14,833 [MainThread] INFO  org.apache.hadoop.ipc.Client - Retrying connect to server: cdh-worker-p2.c.project.internal/10.240.0.17:34431. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=3, sleepTime=1000 MILLISECONDS)
2016-07-31 05:29:15,834 [MainThread] INFO  org.apache.hadoop.ipc.Client - Retrying connect to server: cdh-worker-p2.c.project.internal/10.240.0.17:34431. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=3, sleepTime=1000 MILLISECONDS)
2016-07-31 05:29:15,937 [MainThread] INFO  org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
2016-07-31 05:29:17,045 [MainThread] INFO  org.apache.hadoop.ipc.Client - Retrying connect to server: cdh-worker1.c.project.internal/10.240.0.27:38757. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=3, sleepTime=1000 MILLISECONDS)
2016-07-31 05:29:18,046 [MainThread] INFO  org.apache.hadoop.ipc.Client - Retrying connect to server: cdh-worker1.c.project.internal/10.240.0.27:38757. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=3, sleepTime=1000 MILLISECONDS)
2016-07-31 05:29:19,047 [MainThread] INFO  org.apache.hadoop.ipc.Client - Retrying connect to server: cdh-worker1.c.project.internal/10.240.0.27:38757. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=3, sleepTime=1000 MILLISECONDS)
2016-07-31 05:29:19,149 [MainThread] INFO  org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
2016-07-31 05:29:20,230 [MainThread] INFO  org.apache.hadoop.ipc.Client - Retrying connect to server: cdh-worker3.c.project.internal/10.240.0.25:37952. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=3, sleepTime=1000 MILLISECONDS)
2016-07-31 05:29:21,231 [MainThread] INFO  org.apache.hadoop.ipc.Client - Retrying connect to server: cdh-worker3.c.project.internal/10.240.0.25:37952. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=3, sleepTime=1000 MILLISECONDS)
2016-07-31 05:29:22,232 [MainThread] INFO  org.apache.hadoop.ipc.Client - Retrying connect to server: cdh-worker3.c.project.internal/10.240.0.25:37952. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=3, sleepTime=1000 MILLISECONDS)
2016-07-31 05:29:22,335 [MainThread] INFO  org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
2016-07-31 05:29:22,417 [MainThread] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success!
1 ACCEPTED SOLUTION

avatar
Mentor
This is standard behaviour currently. The retries of 3 times to reach an MR2 AM is controlled via "yarn.app.mapreduce.client-am.ipc.max-retries".

If the client cannot reach the MR2 AM, which is the definitive source of truth during application execution, then it falls back to asking the RM which may answer the state and the redirection if the application has completed.

View solution in original post

2 REPLIES 2

avatar
Mentor
This is standard behaviour currently. The retries of 3 times to reach an MR2 AM is controlled via "yarn.app.mapreduce.client-am.ipc.max-retries".

If the client cannot reach the MR2 AM, which is the definitive source of truth during application execution, then it falls back to asking the RM which may answer the state and the redirection if the application has completed.

avatar
Explorer

Thanks, good to know.