Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Yarn client takes too long to start on NN HA cluster when nn1 is down

Yarn client takes too long to start on NN HA cluster when nn1 is down

In a NN HA cluster (HDP-2.5.0) when nn1 is down and nn2 is Active NN, Yarn client retries to connect to nn1 500 times in 2000ms intervals (1000 sec = 16 min 40 sec). After that it finally goes to nn2 and the yarn application starts. I wonder why is the default so long?? This means that if nn1 is down ALL Yarn applications will be delayed by 17 minutes to start! [When nn2 is down and nn1 is Active NN everything is fine.] I agree that the failover shouldn't be attempted to soon, but 17 minutes sounds just way too long. Below is a snippet from running the pi example.

The other thing is how to change the default. There are 2 properties in yarn-site.xml set initially to "2000, 500":

yarn.resourcemanager.fs.state-store.retry-policy-spec
yarn.node-labels.fs-store.retry-policy-spec 

Changing either one, for example to "2000, 10" or any other value has no effect! The app still does the 500x2000ms retries. Any idea how to change this to a more reasonable numbers? Here is that pi with 10 mappers in 18 minutes(!):

$ hadoop jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-mapreduce-examples.jar pi 10 1000
Number of Maps  = 10
Samples per Map = 1000
17/03/05 07:31:54 WARN ipc.Client: Failed to connect to server: pm2501.pqr-hdp.com/192.168.121.177:8020: try once and fail.
java.net.ConnectException: Connection refused
...
Wrote input for Map #0
...
Wrote input for Map #9
Starting Job
17/03/05 07:31:55 INFO impl.TimelineClientImpl: Timeline service address: http://pm2502.pqr-hdp.com:8188/ws/v1/timeline/
17/03/05 07:31:56 INFO client.AHSProxy: Connecting to Application History server at pm2502.pqr-hdp.com/192.168.121.178:10200
17/03/05 07:31:58 INFO ipc.Client: Retrying connect to server: pm2501.pqr-hdp.com/192.168.121.177:8020. Already tried 0 time(s); retry policy is RetryPolicy[MultipleLinearRandomRetry[500x2000ms], TryOnceThenFail]
17/03/05 07:31:59 INFO ipc.Client: Retrying connect to server: pm2501.pqr-hdp.com/192.168.121.177:8020. Already tried 1 time(s); retry policy is RetryPolicy[MultipleLinearRandomRetry[500x2000ms], TryOnceThenFail]
17/03/05 07:32:01 INFO ipc.Client: Retrying connect to server: pm2501.pqr-hdp.com/192.168.121.177:8020. Already tried 2 time(s); retry policy is RetryPolicy[MultipleLinearRandomRetry[500x2000ms], TryOnceThenFail]
...
17/03/05 07:48:54 INFO ipc.Client: Retrying connect to server: pm2501.pqr-hdp.com/192.168.121.177:8020. Already tried 499 time(s); retry policy is RetryPolicy[MultipleLinearRandomRetry[500x2000ms], TryOnceThenFail]
17/03/05 07:48:56 INFO ipc.Client: Retrying connect to server: pm2501.pqr-hdp.com/192.168.121.177:8020. Already tried 500 time(s); retry policy is RetryPolicy[MultipleLinearRandomRetry[500x2000ms], TryOnceThenFail]
17/03/05 07:48:56 WARN ipc.Client: Failed to connect to server: pm2501.pqr-hdp.com/192.168.121.177:8020: Retry all pairs in MultipleLinearRandomRetry: [500x2000ms]
java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
...
17/03/05 07:48:57 INFO input.FileInputFormat: Total input paths to process : 10
17/03/05 07:48:57 INFO mapreduce.JobSubmitter: number of splits:10
17/03/05 07:48:57 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1488515741091_0005
17/03/05 07:48:57 WARN ipc.Client: Failed to connect to server: pm2501.pqr-hdp.com/192.168.121.177:8020: try once and fail.
java.net.ConnectException: Connection refused
...
17/03/05 07:48:58 INFO impl.YarnClientImpl: Submitted application application_1488515741091_0005
17/03/05 07:48:58 INFO mapreduce.Job: The url to track the job: http://pm2501.pqr-hdp.com:8088/proxy/application_1488515741091_0005/
17/03/05 07:48:58 INFO mapreduce.Job: Running job: job_1488515741091_0005
17/03/05 07:49:05 INFO mapreduce.Job: Job job_1488515741091_0005 running in uber mode : false
17/03/05 07:49:05 INFO mapreduce.Job:  map 0% reduce 0%
17/03/05 07:49:11 INFO mapreduce.Job:  map 50% reduce 0%
17/03/05 07:49:13 INFO mapreduce.Job:  map 90% reduce 0%
17/03/05 07:49:15 INFO mapreduce.Job:  map 100% reduce 0%
17/03/05 07:49:19 INFO mapreduce.Job:  map 100% reduce 100%
...
5 REPLIES 5

Re: Yarn client takes too long to start on NN HA cluster when nn1 is down

Rising Star

@Predrag Minovic can you check the values of the following 2 properties in your cluster ?

yarn.client.nodemanager-connect.max-wait-ms
yarn.client.nodemanager-connect.retry-interval-ms

Re: Yarn client takes too long to start on NN HA cluster when nn1 is down

They are 60,000 and 10,000ms but it doesn't look like they matter.

Re: Yarn client takes too long to start on NN HA cluster when nn1 is down

Rising Star

Right, my bad.

I think this post is seeing the same (or similar) issue and there is some workaround in there -

https://community.hortonworks.com/questions/9586/spark-in-yarn-with-namenode-ha.html

Re: Yarn client takes too long to start on NN HA cluster when nn1 is down

Rising Star

Also, seems like using org.apache.hadoop.hdfs.server.namenode.ha.RequestHedgingProxyProvider for dfs.client.failover.proxy.provider.[nameservice ID] is another possible option. There is some documentation on it in here -

https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithNFS.ht...

Re: Yarn client takes too long to start on NN HA cluster when nn1 is down

New Contributor

Hi Predag, did you find a solution? I am having the same issue.

Thanks, Dirk

Don't have an account?
Coming from Hortonworks? Activate your account here