Support Questions
Find answers, ask questions, and share your expertise

Unable to execute jobs on spark slave node.

Unable to execute jobs on spark slave node.

Hi All,

I am deploying prediction.io on a multinode cluster where training should happen on the worker node. The worker node has been successfully registered with the master.

following are the logs of after starting slaves.sh

p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 11.0px Menlo; color: #000000; background-color: #ffffff}
span.s1 {font-variant-ligatures: no-common-ligatures}



Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties

18/05/22 06:01:44 INFO Worker: Started daemon with process name: 2208@ip-172-31-6-235

18/05/22 06:01:44 INFO SignalUtils: Registered signal handler for TERM

18/05/22 06:01:44 INFO SignalUtils: Registered signal handler for HUP

18/05/22 06:01:44 INFO SignalUtils: Registered signal handler for INT

18/05/22 06:01:44 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

18/05/22 06:01:44 INFO SecurityManager: Changing view acls to: ubuntu

18/05/22 06:01:44 INFO SecurityManager: Changing modify acls to: ubuntu

18/05/22 06:01:44 INFO SecurityManager: Changing view acls groups to:

18/05/22 06:01:44 INFO SecurityManager: Changing modify acls groups to:

18/05/22 06:01:44 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(ubuntu); groups with view permissions: Set(); users  with modify permissions: Set(ubuntu); groups with modify permissions: Set()

18/05/22 06:01:44 INFO Utils: Successfully started service 'sparkWorker' on port 45057.

18/05/22 06:01:44 INFO Worker: Starting Spark worker 172.31.6.235:45057 with 8 cores, 24.0 GB RAM

18/05/22 06:01:44 INFO Worker: Running Spark version 2.1.1

18/05/22 06:01:44 INFO Worker: Spark home: /home/ubuntu/PredictionIO-0.12.0-incubating/vendors/spark-2.1.1-bin-hadoop2.6

18/05/22 06:01:45 INFO Utils: Successfully started service 'WorkerUI' on port 8081.

18/05/22 06:01:45 INFO WorkerWebUI: Bound WorkerWebUI to 0.0.0.0, and started at http://172.31.6.235:8081

18/05/22 06:01:45 INFO Worker: Connecting to master ip-172-31-5-119.ap-southeast-1.compute.internal:7077...

Now the issues:

  1. if I launch one slave on master and one slave my other node:
    1.1 if the slave of the master node is given fewer resources it will give some unable to re-shuffle error.
    1.2 if I give more resources to the worker on the master node the all the execution happens on master node, it does not send any execution to the slave node.
  2. If I do not start a slave on the master node:
    2.1 I get the following error:
    WARN] [TaskSchedulerImpl] Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources

I have assigned 24gb ram to the worker and 8 cores.

However, while I start the process following are the logs I get on slave machine:

p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 11.0px Menlo; color: #000000; background-color: #ffffff}
span.s1 {font-variant-ligatures: no-common-ligatures}



18/05/22 06:01:45 INFO TransportClientFactory: Successfully created connection to ip-172-31-5-119.ap-southeast-1.compute.internal/172.31.5.119:7077 after 19 ms (0 ms spent in bootstraps)

18/05/22 06:01:45 INFO Worker: Successfully registered with master spark://ip-172-31-5-119.ap-southeast-1.compute.internal:7077

18/05/22 06:07:27 INFO Worker: Asked to launch executor app-20180522060727-0000/0 for PredictionIO Training: com.actionml.RecommendationEngine

18/05/22 06:07:27 INFO SecurityManager: Changing view acls to: ubuntu

18/05/22 06:07:27 INFO SecurityManager: Changing modify acls to: ubuntu

18/05/22 06:07:27 INFO SecurityManager: Changing view acls groups to:

18/05/22 06:07:27 INFO SecurityManager: Changing modify acls groups to:

18/05/22 06:07:27 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(ubuntu); groups with view permissions: Set(); users  with modify permissions: Set(ubuntu); groups with modify permissions: Set()

18/05/22 06:07:27 INFO ExecutorRunner: Launch command: "/usr/lib/jvm/java-8-oracle/bin/java" "-cp" "./:/home/ubuntu/PredictionIO-0.12.0-incubating/vendors/spark-2.1.1-bin-hadoop2.6/conf/:/home/ubuntu/PredictionIO-0.12.0-incubating/vendors/spark-2.1.1-bin-hadoop2.6/jars/*" "-Xmx4096M" "-Dspark.driver.port=34031" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" "spark://CoarseGrainedScheduler@172.31.5.119:34031" "--executor-id" "0" "--hostname" "172.31.6.235" "--cores" "8" "--app-id" "app-20180522060727-0000" "--worker-url" "spark://Worker@172.31.6.235:45057"

18/05/22 06:08:02 INFO Worker: Asked to kill executor app-20180522060727-0000/0

18/05/22 06:08:02 INFO ExecutorRunner: Runner thread for executor app-20180522060727-0000/0 interrupted

18/05/22 06:08:02 INFO ExecutorRunner: Killing process!

18/05/22 06:08:02 INFO Worker: Executor app-20180522060727-0000/0 finished with state KILLED exitStatus 143

18/05/22 06:08:02 INFO Worker: Cleaning up local directories for application app-20180522060727-0000

18/05/22 06:08:02 INFO ExternalShuffleBlockResolver: Application app-20180522060727-0000 removed, cleanupLocalDirs = true

Can someone please help me figure out the issue here?
Thanks