Hi,
I'm new to apache spark so Im not sure if this is the best set up, my goal is to create an environment where I can test and evaluate before making decision. I set up cluster on Windows using the steps from:
https://aamargajbhiye.medium.com/apache-spark-setup-a-multi-node-standalone-cluster-on-windows-63d41...
The cluster version Im using is the latest: 3.3.1\Hadoop 3
The master node is starting without an issue and Im able to register the workers on each worker node using the following comand:
spark-class org.apache.spark.deploy.worker.Worker spark://<Master-IP>:7077 --host <Worker-IP>
When I register the worker , its able to connect and register successfully as the message indicates , and Im able to see both workers in the US with the ALIVE status.
Then I tried submitting simple hello_world py job using:
spark-submit --master spark://<Master-IP>:7077 hello_world.py
My hello_world.py application is like this:
spark=SparkSession.builder.appName("Hello World").getOrCreate()
print("Hello From Spark!")
sparkContext=spark.sparkContext
rdd=sparkContext.parallelize([1,2,3])
print(rdd.collect())
What happens when I submit the job is that spark will continuously try to create different executors as if its retrying but they all exit with code 1, and I have to kill it in order to stop.
When I check the UI and I click on a given executor I see the following in the stdout & std err:
stdout:
22/12/12 08:04:11 INFO CoarseGrainedExecutorBackend: Started daemon with process name: 6544@HOU12-FSRM01
22/12/12 08:04:11 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
22/12/12 08:04:11 INFO SecurityManager: Changing view acls to: vnetadmin
22/12/12 08:04:11 INFO SecurityManager: Changing modify acls to: vnetadmin
22/12/12 08:04:11 INFO SecurityManager: Changing view acls groups to:
22/12/12 08:04:11 INFO SecurityManager: Changing modify acls groups to:
22/12/12 08:04:11 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(vnetadmin); groups with view permissions: Set(); users with modify permissions: Set(vnetadmin); groups with modify permissions: Set()
stderr:
sing Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
Exception in thread "main" java.lang.reflect.UndeclaredThrowableException
....
Caused by: org.apache.spark.SparkException: Exception thrown in awaitResult:
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:301)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:102)
at Caused by: java.io.IOException: Failed to connect to <Master DNS>/<Master IP>:56526
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:288)
.....
Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: no further information: <Master DNS>/<Master IP>:56526
Caused by: java.net.ConnectException: Connection refused: no further information
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:715)
at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:330)
....
Not sure how to fix the error above. I tried opening the referenced port "Failed to connect to <Master DNS>/<Master IP>:56526" from the master node but every time it shows a different port.
Note sure what else I can do or how to troubleshoot.
Any help is appreciated.