Support Questions

SAMSAL · ‎12-12-2022

Hi,

I'm new to apache spark so Im not sure if this is the best set up, my goal is to create an environment where I can test and evaluate before making decision. I set up cluster on Windows using the steps from:

https://aamargajbhiye.medium.com/apache-spark-setup-a-multi-node-standalone-cluster-on-windows-63d41...

The cluster version Im using is the latest: 3.3.1\Hadoop 3

The master node is starting without an issue and Im able to register the workers on each worker node using the following comand:

spark-class org.apache.spark.deploy.worker.Worker spark://<Master-IP>:7077 --host <Worker-IP>

When I register the worker , its able to connect and register successfully as the message indicates , and Im able to see both workers in the US with the ALIVE status.

Then I tried submitting simple hello_world py job using:

spark-submit --master spark://<Master-IP>:7077 hello_world.py

My hello_world.py application is like this:

spark=SparkSession.builder.appName("Hello World").getOrCreate()
print("Hello From Spark!")  
sparkContext=spark.sparkContext
rdd=sparkContext.parallelize([1,2,3])
print(rdd.collect())

What happens when I submit the job is that spark will continuously try to create different executors as if its retrying but they all exit with code 1, and I have to kill it in order to stop.

When I check the UI and I click on a given executor I see the following in the stdout & std err:

stdout:

22/12/12 08:04:11 INFO CoarseGrainedExecutorBackend: Started daemon with process name: 6544@HOU12-FSRM01
22/12/12 08:04:11 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
22/12/12 08:04:11 INFO SecurityManager: Changing view acls to: vnetadmin
22/12/12 08:04:11 INFO SecurityManager: Changing modify acls to: vnetadmin
22/12/12 08:04:11 INFO SecurityManager: Changing view acls groups to: 
22/12/12 08:04:11 INFO SecurityManager: Changing modify acls groups to: 
22/12/12 08:04:11 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(vnetadmin); groups with view permissions: Set(); users  with modify permissions: Set(vnetadmin); groups with modify permissions: Set()

stderr:

sing Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
Exception in thread "main" java.lang.reflect.UndeclaredThrowableException
	....
Caused by: org.apache.spark.SparkException: Exception thrown in awaitResult: 
	at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:301)
	at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
	at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:102)
	at Caused by: java.io.IOException: Failed to connect to <Master DNS>/<Master IP>:56526
	at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:288)
	.....
Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: no further information: <Master DNS>/<Master IP>:56526
Caused by: java.net.ConnectException: Connection refused: no further information
	at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
	at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:715)
	at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:330)
	....

Not sure how to fix the error above. I tried opening the referenced port "Failed to connect to <Master DNS>/<Master IP>:56526" from the master node but every time it shows a different port.

Note sure what else I can do or how to troubleshoot.

Any help is appreciated.

RangaReddy · ‎10-24-2023

Hi @SAMSAL

I think you want to run the spark application using Standalone mode. Please follow the following steps:

1. Install the Apache Spark

2. Start the Standalone master and workers. By default master will start with port 7777. Try to access and Standalone UI and see all workers are running expected.

3. Once it is running as expected then submit spark application by specifying standalone master host with 7777

Cloudera Community

Support Questions

spark continously running with exit code 1