Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Unable to execute jobs on spark slave node.

Unable to execute jobs on spark slave node.

New Contributor

Hi All,

I am deploying prediction.io on a multinode cluster where training should happen on the worker node. The worker node has been successfully registered with the master.

following are the logs of after starting slaves.sh

p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 11.0px Menlo; color: #000000; background-color: #ffffff}
span.s1 {font-variant-ligatures: no-common-ligatures}



Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties

18/05/22 06:01:44 INFO Worker: Started daemon with process name: 2208@ip-172-31-6-235

18/05/22 06:01:44 INFO SignalUtils: Registered signal handler for TERM

18/05/22 06:01:44 INFO SignalUtils: Registered signal handler for HUP

18/05/22 06:01:44 INFO SignalUtils: Registered signal handler for INT

18/05/22 06:01:44 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

18/05/22 06:01:44 INFO SecurityManager: Changing view acls to: ubuntu

18/05/22 06:01:44 INFO SecurityManager: Changing modify acls to: ubuntu

18/05/22 06:01:44 INFO SecurityManager: Changing view acls groups to:

18/05/22 06:01:44 INFO SecurityManager: Changing modify acls groups to:

18/05/22 06:01:44 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(ubuntu); groups with view permissions: Set(); users  with modify permissions: Set(ubuntu); groups with modify permissions: Set()

18/05/22 06:01:44 INFO Utils: Successfully started service 'sparkWorker' on port 45057.

18/05/22 06:01:44 INFO Worker: Starting Spark worker 172.31.6.235:45057 with 8 cores, 24.0 GB RAM

18/05/22 06:01:44 INFO Worker: Running Spark version 2.1.1

18/05/22 06:01:44 INFO Worker: Spark home: /home/ubuntu/PredictionIO-0.12.0-incubating/vendors/spark-2.1.1-bin-hadoop2.6

18/05/22 06:01:45 INFO Utils: Successfully started service 'WorkerUI' on port 8081.

18/05/22 06:01:45 INFO WorkerWebUI: Bound WorkerWebUI to 0.0.0.0, and started at http://172.31.6.235:8081

18/05/22 06:01:45 INFO Worker: Connecting to master ip-172-31-5-119.ap-southeast-1.compute.internal:7077...

Now the issues:

  1. if I launch one slave on master and one slave my other node:
    1.1 if the slave of the master node is given fewer resources it will give some unable to re-shuffle error.
    1.2 if I give more resources to the worker on the master node the all the execution happens on master node, it does not send any execution to the slave node.
  2. If I do not start a slave on the master node:
    2.1 I get the following error:
    WARN] [TaskSchedulerImpl] Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources

I have assigned 24gb ram to the worker and 8 cores.

However, while I start the process following are the logs I get on slave machine:

p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 11.0px Menlo; color: #000000; background-color: #ffffff}
span.s1 {font-variant-ligatures: no-common-ligatures}



18/05/22 06:01:45 INFO TransportClientFactory: Successfully created connection to ip-172-31-5-119.ap-southeast-1.compute.internal/172.31.5.119:7077 after 19 ms (0 ms spent in bootstraps)

18/05/22 06:01:45 INFO Worker: Successfully registered with master spark://ip-172-31-5-119.ap-southeast-1.compute.internal:7077

18/05/22 06:07:27 INFO Worker: Asked to launch executor app-20180522060727-0000/0 for PredictionIO Training: com.actionml.RecommendationEngine

18/05/22 06:07:27 INFO SecurityManager: Changing view acls to: ubuntu

18/05/22 06:07:27 INFO SecurityManager: Changing modify acls to: ubuntu

18/05/22 06:07:27 INFO SecurityManager: Changing view acls groups to:

18/05/22 06:07:27 INFO SecurityManager: Changing modify acls groups to:

18/05/22 06:07:27 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(ubuntu); groups with view permissions: Set(); users  with modify permissions: Set(ubuntu); groups with modify permissions: Set()

18/05/22 06:07:27 INFO ExecutorRunner: Launch command: "/usr/lib/jvm/java-8-oracle/bin/java" "-cp" "./:/home/ubuntu/PredictionIO-0.12.0-incubating/vendors/spark-2.1.1-bin-hadoop2.6/conf/:/home/ubuntu/PredictionIO-0.12.0-incubating/vendors/spark-2.1.1-bin-hadoop2.6/jars/*" "-Xmx4096M" "-Dspark.driver.port=34031" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" "spark://CoarseGrainedScheduler@172.31.5.119:34031" "--executor-id" "0" "--hostname" "172.31.6.235" "--cores" "8" "--app-id" "app-20180522060727-0000" "--worker-url" "spark://Worker@172.31.6.235:45057"

18/05/22 06:08:02 INFO Worker: Asked to kill executor app-20180522060727-0000/0

18/05/22 06:08:02 INFO ExecutorRunner: Runner thread for executor app-20180522060727-0000/0 interrupted

18/05/22 06:08:02 INFO ExecutorRunner: Killing process!

18/05/22 06:08:02 INFO Worker: Executor app-20180522060727-0000/0 finished with state KILLED exitStatus 143

18/05/22 06:08:02 INFO Worker: Cleaning up local directories for application app-20180522060727-0000

18/05/22 06:08:02 INFO ExternalShuffleBlockResolver: Application app-20180522060727-0000 removed, cleanupLocalDirs = true

Can someone please help me figure out the issue here?
Thanks

Don't have an account?
Coming from Hortonworks? Activate your account here