Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Spark YARN cluster + Windows client, deploy-mode=client, SparkException: Failed to connect to driver

avatar
Contributor

I have a HDP cluster of version HDP 3.0.0.0. Machines in the cluster are all Ubuntu 16.04 OS.

I want to make a Windows machine able to connect and run Spark on the cluster.

So far I've managed to make Spark submit jobs to the cluster via `spark-submit --deploy-mode cluster --master yarn`.

I'm having trouble running `pyspark` interactive shell with `--deploy-mode client`, which, to my understanding, will create a driver process running on the Windows machine. Right now when I run `pyspark` in a Windows command line console (specifically, I use PowerShell), it always fails with the following outputs:

PS > pyspark --name pysparkTest8
Python 2.7.12 (v2.7.12:d33e0cf91556, Jun 27 2016, 15:19:22) [MSC v.1500 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
2018-08-21 18:27:10 WARN  DomainSocketFactory:117 - The short-circuit local reads feature cannot be used because UNIX Domain sockets are not available on Windows.
2018-08-21 18:40:48 ERROR SparkContext:91 - Error initializing SparkContext.
org.apache.spark.SparkException: Yarn application has already ended! It might have been killed or unable to launch application master.
        at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.waitForApplication(YarnClientSchedulerBackend.scala:89)
        at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:63)
        at org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:164)
        at org.apache.spark.SparkContext.(SparkContext.scala:500)
        at org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:58)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
        at py4j.Gateway.invoke(Gateway.java:238)
        at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
        at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
        at py4j.GatewayConnection.run(GatewayConnection.java:238)
        at java.lang.Thread.run(Thread.java:748)
2018-08-21 18:40:48 WARN  YarnSchedulerBackend$YarnSchedulerEndpoint:66 - Attempted to request executors before the AM has registered!
2018-08-21 18:40:48 WARN  MetricsSystem:66 - Stopping a MetricsSystem that is not running
2018-08-21 18:40:48 WARN  SparkContext:66 - Another SparkContext is being constructed (or threw an exception in its constructor).
 This may indicate an error, since only one SparkContext may be running in this JVM (see SPARK-2243). The other SparkContext was created at:
org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:58)
sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
java.lang.reflect.Constructor.newInstance(Constructor.java:423)
py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)
py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
py4j.Gateway.invoke(Gateway.java:238)
py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
py4j.GatewayConnection.run(GatewayConnection.java:238)
java.lang.Thread.run(Thread.java:748)
2018-08-21 18:54:07 ERROR SparkContext:91 - Error initializing SparkContext.
org.apache.spark.SparkException: Yarn application has already ended! It might have been killed or unable to launch application master.
        at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.waitForApplication(YarnClientSchedulerBackend.scala:89)
        at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:63)
        at org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:164)
        at org.apache.spark.SparkContext.(SparkContext.scala:500)
        at org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:58)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
        at py4j.Gateway.invoke(Gateway.java:238)
        at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
        at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
        at py4j.GatewayConnection.run(GatewayConnection.java:238)
        at java.lang.Thread.run(Thread.java:748)
2018-08-21 18:54:07 WARN  YarnSchedulerBackend$YarnSchedulerEndpoint:66 - Attempted to request executors before the AM has registered!
2018-08-21 18:54:07 WARN  MetricsSystem:66 - Stopping a MetricsSystem that is not running
Traceback (most recent call last):
  File "C:\\python\pyspark\shell.py", line 54, in 
    spark = SparkSession.builder.getOrCreate()
  File "C:\\python\pyspark\sql\session.py", line 173, in getOrCreate
    sc = SparkContext.getOrCreate(sparkConf)
  File "C:\\python\pyspark\context.py", line 343, in getOrCreate
    SparkContext(conf=conf or SparkConf())
  File "C:\\python\pyspark\context.py", line 118, in __init__
    conf, jsc, profiler_cls)
  File "C:\\python\pyspark\context.py", line 180, in _do_init
    self._jsc = jsc or self._initialize_context(self._conf._jconf)
  File "C:\\python\pyspark\context.py", line 282, in _initialize_context
    return self._jvm.JavaSparkContext(jconf)
  File "C:\\python\lib\py4j-0.10.7-src.zip\py4j\java_gateway.py", line 1525, in _
_call__
  File "C:\\python\lib\py4j-0.10.7-src.zip\py4j\protocol.py", line 328, in get_re
turn_value
py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext.
: org.apache.spark.SparkException: Yarn application has already ended! It might have been killed or unable to launch application master.
        at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.waitForApplication(YarnClientSchedulerBackend.scala:89)
        at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:63)
        at org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:164)
        at org.apache.spark.SparkContext.(SparkContext.scala:500)
        at org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:58)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
        at py4j.Gateway.invoke(Gateway.java:238)
        at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
        at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
        at py4j.GatewayConnection.run(GatewayConnection.java:238)
        at java.lang.Thread.run(Thread.java:748)

When I look at the YARN application logs, there's something worth noting in stderr:

Log Type: stderr
Log Upload Time: Tue Aug 21 18:50:14 -0700 2018
Log Length: 3774
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/hadoop/yarn/local/filecache/11/spark2-hdp-yarn-archive.tar.gz/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/hdp/3.0.0.0-1634/hadoop/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
18/08/21 18:36:41 INFO util.SignalUtils: Registered signal handler for TERM
18/08/21 18:36:41 INFO util.SignalUtils: Registered signal handler for HUP
18/08/21 18:36:41 INFO util.SignalUtils: Registered signal handler for INT
18/08/21 18:36:41 INFO spark.SecurityManager: Changing view acls to: yarn,myusername
18/08/21 18:36:41 INFO spark.SecurityManager: Changing modify acls to: yarn,myusername
18/08/21 18:36:41 INFO spark.SecurityManager: Changing view acls groups to: 
18/08/21 18:36:41 INFO spark.SecurityManager: Changing modify acls groups to: 
18/08/21 18:36:41 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(yarn, myusername); groups with view permissions: Set(); users  with modify permissions: Set(yarn, myusername); groups with modify permissions: Set()
18/08/21 18:36:42 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
18/08/21 18:36:42 INFO yarn.ApplicationMaster: Preparing Local resources
18/08/21 18:36:43 WARN shortcircuit.DomainSocketFactory: The short-circuit local reads feature cannot be used because libhadoop cannot be loaded.
18/08/21 18:36:43 INFO yarn.ApplicationMaster: ApplicationAttemptId: appattempt_1534303777268_0044_000001
18/08/21 18:36:44 INFO yarn.ApplicationMaster: Waiting for Spark driver to be reachable.
18/08/21 18:38:51 ERROR yarn.ApplicationMaster: Failed to connect to driver at Windows-client-hostname:50000, retrying ...
18/08/21 18:38:51 ERROR yarn.ApplicationMaster: Uncaught exception: 
org.apache.spark.SparkException: Failed to connect to driver!
    at org.apache.spark.deploy.yarn.ApplicationMaster.waitForSparkDriver(ApplicationMaster.scala:672)
    at org.apache.spark.deploy.yarn.ApplicationMaster.runExecutorLauncher(ApplicationMaster.scala:532)
    at org.apache.spark.deploy.yarn.ApplicationMaster.org$apache$spark$deploy$yarn$ApplicationMaster$$runImpl(ApplicationMaster.scala:347)
    at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$2.apply$mcV$sp(ApplicationMaster.scala:260)
    at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$2.apply(ApplicationMaster.scala:260)
    at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$2.apply(ApplicationMaster.scala:260)
    at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$5.run(ApplicationMaster.scala:815)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:422)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1688)
    at org.apache.spark.deploy.yarn.ApplicationMaster.doAsUser(ApplicationMaster.scala:814)
    at org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:259)
    at org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:839)
    at org.apache.spark.deploy.yarn.ExecutorLauncher$.main(ApplicationMaster.scala:869)
    at org.apache.spark.deploy.yarn.ExecutorLauncher.main(ApplicationMaster.scala)
18/08/21 18:38:51 INFO yarn.ApplicationMaster: Final app status: FAILED, exitCode: 13, (reason: Uncaught exception: org.apache.spark.SparkException: Failed to connect to driver!)
18/08/21 18:38:51 INFO util.ShutdownHookManager: Shutdown hook called

My suspect is that the Windows client machine's firewall is blocking port 50000, because if I run telnet from one of the Ubuntu machines, I get "Connection timed out"

    telnet windows-client-hostname 50000
Trying 10.100.1.61...
telnet: Unable to connect to remote host: Connection timed out

But I have specifically allowed ports 1025-65535 in Inbound Rules in Windows Firewall with Advanced Security (my Windows is Windows Server 2012 R2).

I have configured `spark.port.maxRetries` as suggested in this post, but it didn't change anything. My `spark-defaults.conf` on the Windows client machine looks like this:

spark.master                     yarn
spark.yarn.am.memory             4g
spark.executor.memory            5g
spark.serializer                 org.apache.spark.serializer.KryoSerializer
spark.driver.maxResultSize       10g
spark.driver.memory              5g
spark.yarn.archive               hdfs:///hdp/apps/3.0.0.0-1634/spark2/spark2-hdp-yarn-archive.tar.gz
spark.port.maxRetries            100
spark.driver.port                50000

At this point I am totally confused. Can someone give some hints on how to tackle this?

Thank you very much!

3 REPLIES 3

avatar

@Guozhen Li In yarn client mode the client machine - Windows machine needs to have network access to any of the cluster worker nodes (on any of the executors and AM could potentially run) and vise versa, the executors should be able to connect to the driver running on the windows client machine - I think you are right that this may be due firewall or network problem.

HTH

*** If you found this answer addressed your question, please take a moment to login and click the "accept" link on the answer.

avatar
Contributor

Hey thanks Felix. I figured out it was actually neither Spark nor the firewall. It was due to an extra network adapter created by VirtualBox.

avatar
Contributor

I figured out what went wrong... It actually had nothing to do with Spark or Windows Firewall, but with VirtualBox.

My Windows machine has a VirtualBox installed, and hosts a guest VM. VirtualBox creates a network adapter called something like "VirtualBox Host-Only Network", which has a different IP address than the actual network adapter.

In my case, the actual network adapter is a LAN with IP address 10.100.1.61, and the VirtualBox Host-Only Network has an IP address 192.168.56.1.

I solved the issue by disabling the VirtualBox Host-Only Network in Control Panel >> Network and Internet >> Network Connections.

I found this by first running `pyspark` in PowerShell, then run `netstat -an | Select-String 50000`, and saw someone listening on 192.168.56.1:50000

PS > netstat -an | sls 50000
TCP    192.168.56.1:50000     0.0.0.0:0              LISTENING