Created on 03-24-2014 06:48 AM - edited 09-16-2022 01:55 AM
Hi,
I am a newbie to Apache Spark.
I have installed CDH-5 using parcels (Beta 2 Version) and installed Spark also
As per the Spark installation documentation, http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH5/latest/CDH5-Installation-Guide/c..., it is said,
" Note:
So, if YARN in CDH-5 does not support Spark, how do we run Spark in CDH-5?
Please let me know and also proivde any documentation if available.
Thanks!
Created 03-24-2014 06:51 AM
At the moment, CDH5b2 deploys Spark in "standalone" mode: https://spark.apache.org/docs/0.9.0/spark-standalone.html
This simply means Spark tries to manage resources itself, rather than participating in a cluster manager like YARN or Mesos. As an end user, it shouldn't make much difference to you at all. Just fire up the shell and go.
Once a few kinks are worked out, Spark's YARN integration will be used in the future, as I understand.
Created 03-25-2014 04:29 AM
Are you on CDH5 beta 2? It already includes Spark. I wonder if its setup of Spark is interfering with whatever you have installed separately, or vice versa. Can you simply use the built-in deployment? It would be easier.
Created 03-24-2014 06:51 AM
At the moment, CDH5b2 deploys Spark in "standalone" mode: https://spark.apache.org/docs/0.9.0/spark-standalone.html
This simply means Spark tries to manage resources itself, rather than participating in a cluster manager like YARN or Mesos. As an end user, it shouldn't make much difference to you at all. Just fire up the shell and go.
Once a few kinks are worked out, Spark's YARN integration will be used in the future, as I understand.
Created 03-25-2014 04:24 AM
Hi,
I have installed Spark in Stand alone mode in CDH-5 cluster. But when I start the Spark Master, I am getting the following error -
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/spark/deploy/master/Master Caused by: java.lang.ClassNotFoundException: org.apache.spark.deploy.master.Master at java.net.URLClassLoader$1.run(URLClassLoader.java:202) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:190) at java.lang.ClassLoader.loadClass(ClassLoader.java:306) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301) at java.lang.ClassLoader.loadClass(ClassLoader.java:247) Could not find the main class: org.apache.spark.deploy.master.Master. Program will exit.
I have provided the folder '/usr/lib/spark' in class path and also set the variable ' SPARK_LIBRARY_PATH=/usr/lib/spark' in the file spark-env.sh.
Still I am facing this error.
I installed SPARK using yum.
Could you please assist? Thanks!
Created 03-25-2014 04:29 AM
Are you on CDH5 beta 2? It already includes Spark. I wonder if its setup of Spark is interfering with whatever you have installed separately, or vice versa. Can you simply use the built-in deployment? It would be easier.
Created 03-27-2014 05:15 AM
Created 04-02-2014 07:22 AM
Hi,
I am unable to run even the sample vertification job in Scala. The worker node status is howing as alive with cores 4 (0 Used) and memory 6.7 GB (0.0 B Used).
But I am repeatedly getting the below error. Could you please assist?
WARN scheduler.TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory
Created 04-02-2014 10:00 AM
I believe that means you've requested more memory for a task than any worker has available, but people more knowledgeable might be able to confirm or deny that.
Created 04-02-2014 09:38 PM
But I have not manually requested memory anywhere or set any parameter.
So, is there a way to control this? Thanks!
Created 04-07-2014 12:01 AM
Apache Spark uses the derby database in background and hence only one instance of the 'Spark-Shell' can be connected at any time.
Is there any way to configure mysql or any other RDMS and is there any configuration document?
Created on 07-11-2014 05:00 AM - edited 07-11-2014 05:02 AM
My apologies to reawaken this dead thread, but the subject line was still applicable:
I am trying to get Spark 1.0 to run on on CDH5 YARN. That means NOT the provided 0.9 RPMs.
Still very new to Spark, I'm trying to get things right between what seems like somewhat contradictory information between Apache's Spark site and the CDH5 installation guide. This begins with precisely *what* is installed on *which* node in the cluster. Apache simply stated "on ALL nodes", without any distinction between node role - namenode? datanode? resource manager? node manager? application master? Then there's the client, say a development machine or similar.
The Cloudera docs state to only start the master on one node - yet, which machine is best chose for this? Coming from Mapreduce this would imply the namenode. But it's not made clear that's the right choice. Should it instead be one of the datanodes (in this case they are also the node managers and application masters for YARN, in other words the worker bees of the cluster)? Or maybe it should be the client machine from which jobs are launched through shell or submit?
The Apache cluster mode overview isn't much help either. I suppose that with Spark still being fresh and new, the doco has to catch up from a typical case of "this was totally obvious to the creators" for us mere mortals.
What makes this more confusing is that the 1.0 tarball from Apache is one large directory, as opposed to the RPMs which break the packages down. I assume I could simply install the tarball on each and every machine on the cluster, from client to namenode/resource manager to datanode/node manager and just push out configuration settings to all, but that seems somewhat "sloppy".
In the end it would be great if it were clear enough to take the 1.0 source and create my own RPMs, aligned with the current 0.9 CDH5 packages and install them only where needed, but for that I need a better understanding of what goes where.
Any suggestions and pointers welcome!
Thanks!