Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Spark Jobs failing - firewall issue....?

avatar
Contributor

I have just rolled out a Hadoop/Spark cluster in efforts to kick start a data science program at my company. I used Ambari as the manager and installed the Hortonworks distribution (HDFS 2.7.3, Hive 1.2.1, Spark 2.1.1, as well as the other required services. By the way, I am running RHEL 7. I have 2 name nodes, 10 data nodes, 1 hive node and 1 management node (Ambari).

I built a list of firewall ports based on Apache and Ambari documentation and had my infrastructure guys push those rules. I ran into an issue with Spark wanting to pick random ports. When I attempted to run a Spark job (the traditional Pi example), it would fail, as I did not have the whole ephemeral port range open. Since we will probably be running multiple jobs, it makes sense to let Spark handle this and just choose from the ephemeral range of ports (1024 - 65535) rather than specifying a single port. I know I can pick a range, but to make it easy I just asked my guys to open the whole ephemeral range. At first my infrastructure guys balked at that, but when I told them the purpose, they went ahead and did so.

Based on that, I thought I had my issue fixed, but when I attempt to run a job, it still fails with:

Log Type: stderr 

Log Upload Time: Thu Oct 12 11:31:01 -0700 2017 

Log Length: 14617 

Showing 4096 bytes of 14617 total. Click here for the full log. 

Failed to connect to driver at 10.10.98.191:33937, retrying ...
17/10/12 11:28:52 ERROR ApplicationMaster: Failed to connect to driver at 10.10.98.191:33937, retrying ...
17/10/12 11:28:53 ERROR ApplicationMaster: Failed to connect to driver at 10.10.98.191:33937, retrying ...
17/10/12 11:28:54 ERROR ApplicationMaster: Failed to connect to driver at 10.10.98.191:33937, retrying ...
17/10/12 11:28:55 ERROR ApplicationMaster: Failed to connect to driver at 10.10.98.191:33937, retrying ...
17/10/12 11:28:56 ERROR ApplicationMaster: Failed to connect to driver at 10.10.98.191:33937, retrying ...
17/10/12 11:28:57 ERROR ApplicationMaster: Failed to connect to driver at 10.10.98.191:33937, retrying ...
17/10/12 11:28:57 ERROR ApplicationMaster: Failed to connect to driver at 10.10.98.191:33937, retrying ...
17/10/12 11:28:59 ERROR ApplicationMaster: Failed to connect to driver at 10.10.98.191:33937, retrying ...
17/10/12 11:29:00 ERROR ApplicationMaster: Failed to connect to driver at 10.10.98.191:33937, retrying ...
17/10/12 11:29:01 ERROR ApplicationMaster: Failed to connect to driver at 10.10.98.191:33937, retrying ...
17/10/12 11:29:02 ERROR ApplicationMaster: Failed to connect to driver at 10.10.98.191:33937, retrying ...
17/10/12 11:29:03 ERROR ApplicationMaster: Failed to connect to driver at 10.10.98.191:33937, retrying ...
17/10/12 11:29:04 ERROR ApplicationMaster: Failed to connect to driver at 10.10.98.191:33937, retrying ...
17/10/12 11:29:05 ERROR ApplicationMaster: Failed to connect to driver at 10.10.98.191:33937, retrying ...
17/10/12 11:29:06 ERROR ApplicationMaster: Failed to connect to driver at 10.10.98.191:33937, retrying ...
17/10/12 11:29:06 ERROR ApplicationMaster: Failed to connect to driver at 10.10.98.191:33937, retrying ...
17/10/12 11:29:07 ERROR ApplicationMaster: Failed to connect to driver at 10.10.98.191:33937, retrying ...
17/10/12 11:29:09 ERROR ApplicationMaster: Failed to connect to driver at 10.10.98.191:33937, retrying ...
17/10/12 11:29:10 ERROR ApplicationMaster: Failed to connect to driver at 10.10.98.191:33937, retrying ...
17/10/12 11:29:11 ERROR ApplicationMaster: Failed to connect to driver at 10.10.98.191:33937, retrying ...
17/10/12 11:29:12 ERROR ApplicationMaster: Failed to connect to driver at 10.10.98.191:33937, retrying ...
17/10/12 11:29:13 ERROR ApplicationMaster: Failed to connect to driver at 10.10.98.191:33937, retrying ...
17/10/12 11:29:14 ERROR ApplicationMaster: Failed to connect to driver at 10.10.98.191:33937, retrying ...
17/10/12 11:29:15 ERROR ApplicationMaster: Failed to connect to driver at 10.10.98.191:33937, retrying ...
17/10/12 11:29:15 ERROR ApplicationMaster: Uncaught exception: 
org.apache.spark.SparkException: Failed to connect to driver!
    at org.apache.spark.deploy.yarn.ApplicationMaster.waitForSparkDriver(ApplicationMaster.scala:607)
    at org.apache.spark.deploy.yarn.ApplicationMaster.runExecutorLauncher(ApplicationMaster.scala:461)
    at org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:283)
    at org.apache.spark.deploy.yarn.ApplicationMaster$anonfun$main$1.apply$mcV$sp(ApplicationMaster.scala:783)
    at org.apache.spark.deploy.SparkHadoopUtil$anon$1.run(SparkHadoopUtil.scala:67)
    at org.apache.spark.deploy.SparkHadoopUtil$anon$1.run(SparkHadoopUtil.scala:66)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:422)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1866)
    at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:66)
    at org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:781)
    at org.apache.spark.deploy.yarn.ExecutorLauncher$.main(ApplicationMaster.scala:804)
    at org.apache.spark.deploy.yarn.ExecutorLauncher.main(ApplicationMaster.scala)
17/10/12 11:29:15 INFO ApplicationMaster: Final app status: FAILED, exitCode: 10, (reason: Uncaught exception: org.apache.spark.SparkException: Failed to connect to driver!)
17/10/12 11:29:15 INFO ShutdownHookManager: Shutdown hook called

At first I thought maybe I had some sort of misconfiguration with Spark and the namenodes/datanodes. However, to test it, I simply stopped firewalld on every node and attempted the job again and it worked just fine.

So, my question - I have the entire 1024 - 65535 port range open - I can see the Spark drivers are trying to connect on those high ports (as show above - 30k - 40k range). However, for some reason when the firewall is on, it fails and when its off it works. I checked the firewall rules and sure enough, the ports are open - and those rules are working as I can access the web services for Ambair, Yarn and HFDS which are specified in the same firewalld xml rules file....

I am new to Hadoop/Spark, so I am wondering is there something I am missing? Is there some lower port under 1024 I need to account for? Here is a list of the ports below 1024 I have open, in addition to the 1024 - 65535 port range:

88
111
443
1004
1006
1019

It's quite possible I missed a lower number port that I really need and just don't know it. Above that, everything else should be handled by the 1024 - 65535 port range. Thank you in advance.

1 ACCEPTED SOLUTION

avatar
Explorer

I think you have done the right thing here Charles. I had a setup where I set these ports up to specific values as well and then configure the max retries for these ports to control how many different applications can run in parallel. This configuration is in place to ensure that you dont have too many ports open like the setup you have I guess.

So for example, if my spark.driver.port is set to some random value 40000 and then I specify spark.port.maxRetries as 32, then it will retry with 32 ports starting from 40000 and if a free port is available, it will connect to it. If not, the application will fail to start a spark context. But you should probably space each of the port ranges apart by 32 in this case so that you can have a range of 32 for each port. So in my configuration, it would be spark.driver.port: 40000, spark.blockManager.port:40033 etc.

spark.executor.port shouldnt be relevant here for the firewall on the driver if you are running it with a cluster manager like yarn.Unless you are running spark standalone ofcourse then it could matter.

You can see the documentation for this here: https://spark.apache.org/docs/latest/configuration.html

Cheers

View solution in original post

6 REPLIES 6

avatar
Contributor

So, something one of my security guys just mentioned is that, even thought i have the ports open, how does the worker node know what port to be listening on.... He's got me convinced that even thought the port range is open, it still needs to be listening on those ports when the spark driver tries to be contacted by the job.... This seems logical, especially since when I turn off firewall services, it works. Am I going down the wrong path here?

avatar
Contributor

I checked on the incoming/outgoing configuration, and that is not an issue. We don't configure things to that level. We either open the port or we don't open the port. I found another post that seems similar to my issues at: http://apache-spark-user-list.1001560.n3.nabble.com/spark-shell-driver-interacting-with-Workers-in-Y... . I think what I am facing is that on the data nodes, how does Spark know what port the worker should be listening on? It makes sense that, even though the firewall is open, nothing can connect it nothing is listening....

avatar
Contributor
So, I decided to start back at square one and assign a specific port via the settings in Spark (in the Ambari interface). Here are those custom settings: spark.blockManager.port = 40000, spark.broadcast.port = 40001, spark.driver.port = 40002, spark.executor.port = 40003, spark.fileserver.port = 40004, spark.replClassServer.port = 40005, spark.port.maxRetries = 5000. I added that last one on as I read that give Spark a range to use. Now things are working. Will this somehow prevent me from running multiple applications at once. That is the feeling I get. What is the purpose of these settings?

avatar
Explorer

I think you have done the right thing here Charles. I had a setup where I set these ports up to specific values as well and then configure the max retries for these ports to control how many different applications can run in parallel. This configuration is in place to ensure that you dont have too many ports open like the setup you have I guess.

So for example, if my spark.driver.port is set to some random value 40000 and then I specify spark.port.maxRetries as 32, then it will retry with 32 ports starting from 40000 and if a free port is available, it will connect to it. If not, the application will fail to start a spark context. But you should probably space each of the port ranges apart by 32 in this case so that you can have a range of 32 for each port. So in my configuration, it would be spark.driver.port: 40000, spark.blockManager.port:40033 etc.

spark.executor.port shouldnt be relevant here for the firewall on the driver if you are running it with a cluster manager like yarn.Unless you are running spark standalone ofcourse then it could matter.

You can see the documentation for this here: https://spark.apache.org/docs/latest/configuration.html

Cheers

avatar
Contributor

@Subramaniam Ramasubramania - Thanks for the feedback. I'm glad this will work with multiple applications. I tried it with the Pi example, but since that example executes so fast, it was still grabbing the 40000 port for both jobs. I'm relieved to know this will work long term. So for the port spacing, based on your recommendation, should I do the following: spark.blockManager.port = 40000, spark.broadcast.port = 40033, spark.driver.port = 40065, spark.executor.port = 40097, spark.fileserver.port = 40129, spark.replClassServer.port = 40161, spark.port.maxRetries = 5000? I know 5000 ports is a lot for the max retries, but I could probably bring that down to like 250, I just wanted to be safe. Does this sound better than what I have?

avatar
Explorer

That seems more reasonable. But if you want to reduce the the port.maxRetries to 250, then better have a spacing of 250. And I think there was a typo, 40000-40031 is 32 ports so you can change it to 40032 if you are using maxRetries to 32 ports.

And again, the executor ports will depend on what mode you are running spark on(standalone vs cluster vs client).