Support Questions

geko · ‎04-13-2014

Hi,

I'm going to start working with Spark and installed the parcels in our CDH5 GA cluster.

Master: hadoop-pg-5.cluster, Worker: hadoop-pg-7.cluster

Both daemons are running, Master-Web-UI shows the connected worker, and the log entries show:

master:

2014-04-13 21:26:40,641 INFO Remoting: Starting remoting
2014-04-13 21:26:40,930 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkMaster@hadoop-pg-5.cluster:7077]
2014-04-13 21:26:41,356 INFO org.apache.spark.deploy.master.Master: Starting Spark master at spark://hadoop-pg-5.cluster:7077
...

2014-04-13 21:26:41,439 INFO org.eclipse.jetty.server.AbstractConnector: Started SelectChannelConnector@0.0.0.0:18080
2014-04-13 21:26:41,441 INFO org.apache.spark.deploy.master.ui.MasterWebUI: Started Master web UI at http://hadoop-pg-5.cluster:18080
2014-04-13 21:26:41,476 INFO org.apache.spark.deploy.master.Master: I have been elected leader! New state: ALIVE

2014-04-13 21:27:40,319 INFO org.apache.spark.deploy.master.Master: Registering worker hadoop-pg-5.cluster:7078 with 2 cores, 64.0 MB RAM

worker:

2014-04-13 21:27:39,037 INFO akka.event.slf4j.Slf4jLogger: Slf4jLogger started
2014-04-13 21:27:39,136 INFO Remoting: Starting remoting
2014-04-13 21:27:39,413 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkWorker@hadoop-pg-7.cluster:7078]
2014-04-13 21:27:39,706 INFO org.apache.spark.deploy.worker.Worker: Starting Spark worker hadoop-pg-7.cluster:7078 with 2 cores, 64.0 MB RAM
2014-04-13 21:27:39,708 INFO org.apache.spark.deploy.worker.Worker: Spark home: /opt/cloudera/parcels/CDH-5.0.0-1.cdh5.0.0.p0.47/lib/spark
...

2014-04-13 21:27:39,888 INFO org.eclipse.jetty.server.AbstractConnector: Started SelectChannelConnector@0.0.0.0:18081
2014-04-13 21:27:39,889 INFO org.apache.spark.deploy.worker.ui.WorkerWebUI: Started Worker web UI at http://hadoop-pg-7.cluster:18081
2014-04-13 21:27:39,890 INFO org.apache.spark.deploy.worker.Worker: Connecting to master spark://hadoop-pg-5.cluster:7077...
2014-04-13 21:27:40,360 INFO org.apache.spark.deploy.worker.Worker: Successfully registered with master spark://hadoop-pg-5.cluster:7077

Looks good, so far.

Now I want to execute the python pi example by executing (on the worker):

cd /opt/cloudera/parcels/CDH/lib/spark && ./bin/pyspark ./python/examples/pi.py spark://hadoop-pg-5.cluster:7077

Here the strange thing happens, the script doesn't get executed, it hangs (repeating this output forever) at :

14/04/13 21:31:03 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory
14/04/13 21:31:18 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory

The whole log is:

14/04/13 21:30:44 INFO Slf4jLogger: Slf4jLogger started
14/04/13 21:30:45 INFO Remoting: Starting remoting
14/04/13 21:30:45 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://spark@hadoop-pg-7.cluster:50601]
14/04/13 21:30:45 INFO Remoting: Remoting now listens on addresses: [akka.tcp://spark@hadoop-pg-7.cluster:50601]
14/04/13 21:30:45 INFO SparkEnv: Registering BlockManagerMaster
14/04/13 21:30:45 INFO DiskBlockManager: Created local directory at /tmp/spark-local-20140413213045-acec
14/04/13 21:30:45 INFO MemoryStore: MemoryStore started with capacity 294.9 MB.
14/04/13 21:30:45 INFO ConnectionManager: Bound socket to port 57506 with id = ConnectionManagerId(hadoop-pg-7.cluster,57506)
14/04/13 21:30:45 INFO BlockManagerMaster: Trying to register BlockManager
14/04/13 21:30:45 INFO BlockManagerMasterActor$BlockManagerInfo: Registering block manager hadoop-pg-7.cluster:57506 with 294.9 MB RAM
14/04/13 21:30:45 INFO BlockManagerMaster: Registered BlockManager
14/04/13 21:30:45 INFO HttpServer: Starting HTTP Server
14/04/13 21:30:45 INFO HttpBroadcast: Broadcast server started at http://10.147.210.7:51224
14/04/13 21:30:45 INFO SparkEnv: Registering MapOutputTracker
14/04/13 21:30:45 INFO HttpFileServer: HTTP File server directory is /tmp/spark-f9ab98c8-2adf-460a-9099-6dc07c7dc89f
14/04/13 21:30:45 INFO HttpServer: Starting HTTP Server
14/04/13 21:30:46 INFO SparkUI: Started Spark Web UI at http://hadoop-pg-7.cluster:4040
14/04/13 21:30:46 INFO AppClient$ClientActor: Connecting to master spark://hadoop-pg-5.cluster:7077...
14/04/13 21:30:47 INFO SparkDeploySchedulerBackend: Connected to Spark cluster with app ID app-20140413213046-0000
14/04/13 21:30:48 INFO SparkContext: Starting job: reduce at ./python/examples/pi.py:36
14/04/13 21:30:48 INFO DAGScheduler: Got job 0 (reduce at ./python/examples/pi.py:36) with 2 output partitions (allowLocal=false)
14/04/13 21:30:48 INFO DAGScheduler: Final stage: Stage 0 (reduce at ./python/examples/pi.py:36)
14/04/13 21:30:48 INFO DAGScheduler: Parents of final stage: List()
14/04/13 21:30:48 INFO DAGScheduler: Missing parents: List()
14/04/13 21:30:48 INFO DAGScheduler: Submitting Stage 0 (PythonRDD[1] at reduce at ./python/examples/pi.py:36), which has no missing parents
14/04/13 21:30:48 INFO DAGScheduler: Submitting 2 missing tasks from Stage 0 (PythonRDD[1] at reduce at ./python/examples/pi.py:36)
14/04/13 21:30:48 INFO TaskSchedulerImpl: Adding task set 0.0 with 2 tasks
14/04/13 21:31:03 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory
14/04/13 21:31:18 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory

Thereby I have to cancel the execution of the script. If I am doing this, I receive the following log entries on the master (! at cancellation of the python pi script !):

2014-04-13 21:30:46,965 INFO org.apache.spark.deploy.master.Master: Registering app PythonPi
2014-04-13 21:30:46,974 INFO org.apache.spark.deploy.master.Master: Registered app PythonPi with ID app-20140413213046-0000
2014-04-13 21:31:27,123 INFO org.apache.spark.deploy.master.Master: akka.tcp://spark@hadoop-pg-7.cluster:50601 got disassociated, removing it.
2014-04-13 21:31:27,125 INFO org.apache.spark.deploy.master.Master: Removing app app-20140413213046-0000
2014-04-13 21:31:27,143 INFO org.apache.spark.deploy.master.Master: akka.tcp://spark@hadoop-pg-7.cluster:50601 got disassociated, removing it.
2014-04-13 21:31:27,144 INFO akka.actor.LocalActorRef: Message [akka.remote.transport.ActorTransportAdapter$DisassociateUnderlying] from Actor[akka://sparkMaster/deadLetters] to Actor[akka://sparkMaster/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkMaster%4010.147.210.7%3A44207-2#-389971336] was not delivered. [1] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.
2014-04-13 21:31:27,194 ERROR akka.remote.EndpointWriter: AssociationError [akka.tcp://sparkMaster@hadoop-pg-5.cluster:7077] -> [akka.tcp://spark@hadoop-pg-7.cluster:50601]: Error [Association failed with [akka.tcp://spark@hadoop-pg-7.cluster:50601]] [
akka.remote.EndpointAssociationException: Association failed with [akka.tcp://spark@hadoop-pg-7.cluster:50601]
Caused by: akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: Connection refused: hadoop-pg-7.cluster/10.147.210.7:50601
]
2014-04-13 21:31:27,199 INFO org.apache.spark.deploy.master.Master: akka.tcp://spark@hadoop-pg-7.cluster:50601 got disassociated, removing it.
2014-04-13 21:31:27,215 ERROR akka.remote.EndpointWriter: AssociationError [akka.tcp://sparkMaster@hadoop-pg-5.cluster:7077] -> [akka.tcp://spark@hadoop-pg-7.cluster:50601]: Error [Association failed with [akka.tcp://spark@hadoop-pg-7.cluster:50601]] [
akka.remote.EndpointAssociationException: Association failed with [akka.tcp://spark@hadoop-pg-7.cluster:50601]
Caused by: akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: Connection refused: hadoop-pg-7.cluster/10.147.210.7:50601
]
2014-04-13 21:31:27,222 INFO org.apache.spark.deploy.master.Master: akka.tcp://spark@hadoop-pg-7.cluster:50601 got disassociated, removing it.
2014-04-13 21:31:27,234 ERROR akka.remote.EndpointWriter: AssociationError [akka.tcp://sparkMaster@hadoop-pg-5.cluster:7077] -> [akka.tcp://spark@hadoop-pg-7.cluster:50601]: Error [Association failed with [akka.tcp://spark@hadoop-pg-7.cluster:50601]] [
akka.remote.EndpointAssociationException: Association failed with [akka.tcp://spark@hadoop-pg-7.cluster:50601]
Caused by: akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: Connection refused: hadoop-pg-7.cluster/10.147.210.7:50601
]
2014-04-13 21:31:27,238 INFO org.apache.spark.deploy.master.Master: akka.tcp://spark@hadoop-pg-7.cluster:50601 got disassociated, removing it.

What is going wrong here ?!?!?!?

I get the same behaviour if I start the spark-shell on the worker and try to execute e.g. sc.parallelize(1 to 100,10).count

any help highly appreciated, Gerd

srowen · ‎04-14-2014

I would bet that this means that the amount of memory you have requested for your executors exceeds the amount of memory that any single worker of yours has. What are these sizes?

View solution in original post

srowen · ‎04-14-2014

I would bet that this means that the amount of memory you have requested for your executors exceeds the amount of memory that any single worker of yours has. What are these sizes?

geko · ‎04-14-2014

Hi Sean,

thanks for your hint, increasing the worker memory settings solved the problem.

I set the worker_max_heapsize to its default val of 512MB (it was just 64MB before) and the total executor memsize to 2GB.

thanks, Gerd

Harihar · ‎10-15-2014

Hi I' running cdh5.1.3 libs on cdh5 cluster but when I run spark pgrogramm it gives me these exception :

2014-10-16 14:37:38,312 INFO org.apache.spark.deploy.worker.Worker: Asked to launch executor app-20141016143738-0008/1 for SparkROnCluster
2014-10-16 14:37:38,317 ERROR org.apache.spark.deploy.worker.ExecutorRunner: Error running executor
java.io.IOException: Cannot run program "/run/cloudera-scm-agent/process/256-spark-SPARK_WORKER/bin/compute-classpath.sh" (in directory "."): error=2, No such file or directory
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1041)
at org.apache.spark.util.Utils$.executeAndGetOutput(Utils.scala:759)
at org.apache.spark.deploy.worker.CommandUtils$.buildJavaOpts(CommandUtils.scala:72)
at org.apache.spark.deploy.worker.CommandUtils$.buildCommandSeq(CommandUtils.scala:37)
at org.apache.spark.deploy.worker.ExecutorRunner.getCommandSeq(ExecutorRunner.scala:109)
at org.apache.spark.deploy.worker.ExecutorRunner.fetchAndRunExecutor(ExecutorRunner.scala:124)
at org.apache.spark.deploy.worker.ExecutorRunner$$anon$1.run(ExecutorRunner.scala:58)
Caused by: java.io.IOException: error=2, No such file or directory
at java.lang.UNIXProcess.forkAndExec(Native Method)
at java.lang.UNIXProcess.<init>(UNIXProcess.java:186)
at java.lang.ProcessImpl.start(ProcessImpl.java:130)
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1022)
... 6 more

When I'm runnig sample examples like wordcount tallSVd it runs fine. what changes I should made to make my appliation should run this script file?