Support Questions

Find answers, ask questions, and share your expertise

Active master not accepting new applications if one of the masters added to zookeeper is down

Hi Team,

I am facing a very weird issue while enabling high availability(HA) in spark stand alone cluster.

I have configured 3 spark masters and registered them in zookeeper by following below steps:

  1. Create a configuration file ha.conf with the content as follows:
spark.deploy.recoveryMode=ZOOKEEPER
spark.deploy.zookeeper.url=ZK_HOST:2181
spark.deploy.zookeeper.dir=/spark

2. start all 3 masters by passing this property file as argument to start-master script as below:

./start-master.sh -h localhost -p 17077 --webui-port 18080 --properties-file ha.con

this way I got all 3 spark master started and registered in zookeeper.

Working If I kill the active master then all the running application gets picked up by the new active master.

Not Working If any one spark master(for eg: localhost:17077) is down/not working and I submit an application using the below command:

./bin/spark-submit --class WordCount --master spark://localhost:17077,h2:27077,h3:37077 --deploy-mode cluster --conf spark.cores.max=1 ~/TestSpark-0.0.1-SNAPSHOT.jar /user1/test.txt

Ideally that should go to the active master and should work fine because only one master is down and others are working but I am getting exception as:

Exception in thread "main" org.apache.spark.SparkException: Exception thrown in awaitResult
    at org.apache.spark.rpc.RpcTimeout$anonfun$1.applyOrElse(RpcTimeout.scala:77)
    at org.apache.spark.rpc.RpcTimeout$anonfun$1.applyOrElse(RpcTimeout.scala:75)
    at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
    at org.apache.spark.rpc.RpcTimeout$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
    at org.apache.spark.rpc.RpcTimeout$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
    at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167)
    at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83)
    at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:88)
    at org.apache.spark.rpc.RpcEnv.setupEndpointRef(RpcEnv.scala:96)
    at org.apache.spark.deploy.Client$anonfun$7.apply(Client.scala:230)
    at org.apache.spark.deploy.Client$anonfun$7.apply(Client.scala:230)
    at scala.collection.TraversableLike$anonfun$map$1.apply(TraversableLike.scala:234)
    at scala.collection.TraversableLike$anonfun$map$1.apply(TraversableLike.scala:234)
    at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
    at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
    at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
    at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
    at org.apache.spark.deploy.Client$.main(Client.scala:230)
    at org.apache.spark.deploy.Client.main(Client.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$runMain(SparkSubmit.scala:736)
    at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:185)
    at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:210)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:124)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.io.IOException: Failed to connect to localhost/127.0.0.1:17077
        at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:228)
        at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:179)
        at org.apache.spark.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala:197)
        at org.apache.spark.rpc.netty.Outbox$anon$1.call(Outbox.scala:191)
        at org.apache.spark.rpc.netty.Outbox$anon$1.call(Outbox.scala:187)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
Caused by: java.net.ConnectException: Connection refused: localhost/127.0.0.1:17077
        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
        at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
        at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:224)
        at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:289)
        at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
        at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
        ... 1 more

Any help/clue/suggestion is appreciated. Please help me understand this, I have searched for problems like this but could not find anything.

NOTE:

I am facing this problem when I submit the application in cluster mode and there's no problem if I submit the application in client mode.

6 REPLIES 6

Mentor

sachin gupta

I think you need a zookeeper quorum at least 3.

Can you have a look at this ha-setup

Please let me know

@Geoffrey Shelton Okot

I set up 3 zookeeper quorum but no luck it still did not work. I am not understanding why spark-submit is looking for all the spark masters and if any one among them is not working it throws as connection refused.

Please give me some more pointers.

Hi @Geoffrey Shelton Okot, do you have any other solution for this ?

Mentor

@sachin gupta

To better understand your situation I will need to reproduce it. Do you have any document you created for this setup? If so can you share it, I fail to understand how the HA fails when the 3 spark masters are registered with Zookeeper and if your conf file is correct.

Can you get me the entry of the zookeeper with ls /?

# bin/zkCli.sh

[zk: localhost:2181(CONNECTED) 0]  ls / [hive, registry, cluster, controller, brokers, zookeeper, infra-solr, hbase-unsecure, kafka-acl, kafka-acl-changes, admin, isr_change_notification, templeton-hadoop, hiveserver2, controller_epoch, druid, rmstore, hbase-secure, ambari-metrics-cluster, consumers, config]

Please revert

@Geoffrey Shelton Okot

output of the ls under zookeeper client is [opt, zookeeper, spark].

And please see this link for the steps I took to start master in HA mode. Let me know if any steps is not clear to you.

The application can be submitted to spark rest server which runs on 6066 rather then submitting on legacy system runs on 7077.

Now if one spark master is down then application gets submitted to the other spark master.

So the issue got fixed when application is submitted to rest server using the below command:

./bin/spark-submit --class WordCount --master spark://localhost:6066,h2:6066,h3:6066 --deploy-mode cluster --conf spark.cores.max=1 ~/TestSpark-0.0.1-SNAPSHOT.jar /user1/test.txt
Take a Tour of the Community
Don't have an account?
Your experience may be limited. Sign in to explore more.