Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Failed to connect to server (port 8032) retries get failed due to exceeded maximum allowed retries number: 0

avatar
Rising Star

Hi all,

I am using HDP 2.5. When I try to run a spark job or context (using a Jupyter notebook or pyspark shell), I always obtain the following error:

WARN Client: Failed to connect to server: mycluster.at/111.11.11.11:8032: retries get failed due to exceeded maximum allowed retries number: 0
java.net.ConnectException: Connection refused
        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
        at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
        at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
        at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:531)
        at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:495)
        at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:650)
        at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:745)
        at org.apache.hadoop.ipc.Client$Connection.access$3200(Client.java:397)
        at org.apache.hadoop.ipc.Client.getConnection(Client.java:1618)
        at org.apache.hadoop.ipc.Client.call(Client.java:1449)
        at org.apache.hadoop.ipc.Client.call(Client.java:1396)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233)
        at com.sun.proxy.$Proxy15.getNewApplication(Unknown Source)
        at org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getNewApplication(ApplicationClientProtocolPBClientImpl.java:221)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:278)
        at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:194)
        at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:176)
        at com.sun.proxy.$Proxy16.getNewApplication(Unknown Source)
        at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getNewApplication(YarnClientImpl.java:225)
        at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.createApplication(YarnClientImpl.java:233)
        at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:157)
        at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:56)
        at org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:149)
        at org.apache.spark.SparkContext.<init>(SparkContext.scala:500)
        at org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:58)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:240)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
        at py4j.Gateway.invoke(Gateway.java:236)
        at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
        at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
        at py4j.GatewayConnection.run(GatewayConnection.java:211)
        at java.lang.Thread.run(Thread.java:745)

Then the job is running fine, but that warning is always there. I have another cluster with HDP 2.4 and I don't see that warning.

Any ideas?

Thanks in advance,

1 ACCEPTED SOLUTION

avatar
Super Collaborator

Your standby RM (rm1) must be the first RM in the configured list of RMs. So its tried first and that results in exceptions.

View solution in original post

4 REPLIES 4

avatar
Master Guru

@Jose Molero

Do you have resource manager HA cofigured? by looking at this error, it looks like your rm1 is standby and rm2 works fine.

Can you please check?

avatar
Rising Star

Hi @Kuldeep Kulkarni

Yes, Resource manager HA is configured, but both are working fine, just rm1 is in standby mode and rm2 is active.

avatar
Super Collaborator

Your standby RM (rm1) must be the first RM in the configured list of RMs. So its tried first and that results in exceptions.

avatar
Rising Star

Hi @bikas, ok I understood. So I should not worry about. Thanks