Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Spark on YARN in CDH-5

avatar
Explorer

Hi,

 

I am a newbie to Apache Spark.

 

I have installed CDH-5 using parcels (Beta 2 Version) and installed Spark also

 

 

As per the Spark installation documentation,  http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH5/latest/CDH5-Installation-Guide/c..., it is said,

 

Note:

  • The current version of CDH 5 does not support running Spark on YARN.
  • The current version of Spark does work in a secure cluster."

 

So, if YARN in CDH-5 does not support Spark, how do we run Spark in CDH-5?

 

Please let me know and also proivde any documentation if available.

 

Thanks!

2 ACCEPTED SOLUTIONS

avatar
Master Collaborator

At the moment, CDH5b2 deploys Spark in "standalone" mode: https://spark.apache.org/docs/0.9.0/spark-standalone.html

 

This simply means Spark tries to manage resources itself, rather than participating in a cluster manager like YARN or Mesos. As an end user, it shouldn't make much difference to you at all. Just fire up the shell and go.

 

Once a few kinks are worked out, Spark's YARN integration will be used in the future, as I understand.

View solution in original post

avatar
Master Collaborator

Are you on CDH5 beta 2? It already includes Spark. I wonder if its setup of Spark is interfering with whatever you have installed separately, or vice versa. Can you simply use the built-in deployment? It would be easier.

View solution in original post

25 REPLIES 25

avatar
Master Collaborator

At the moment, CDH5b2 deploys Spark in "standalone" mode: https://spark.apache.org/docs/0.9.0/spark-standalone.html

 

This simply means Spark tries to manage resources itself, rather than participating in a cluster manager like YARN or Mesos. As an end user, it shouldn't make much difference to you at all. Just fire up the shell and go.

 

Once a few kinks are worked out, Spark's YARN integration will be used in the future, as I understand.

avatar
Explorer

Hi,

 

I have installed Spark in Stand alone mode in CDH-5 cluster. But when I start the Spark Master, I am getting the following error -

 

Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/spark/deploy/master/Master
Caused by: java.lang.ClassNotFoundException: org.apache.spark.deploy.master.Master
        at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
        at java.security.AccessController.doPrivileged(Native Method)
        at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
        at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
Could not find the main class: org.apache.spark.deploy.master.Master.  Program will exit.

 

I have provided the folder '/usr/lib/spark' in class path and also set the variable ' SPARK_LIBRARY_PATH=/usr/lib/spark' in the file spark-env.sh.

 

Still I am facing this error.

 

I installed SPARK using yum.

 

Could you please assist? Thanks!

avatar
Master Collaborator

Are you on CDH5 beta 2? It already includes Spark. I wonder if its setup of Spark is interfering with whatever you have installed separately, or vice versa. Can you simply use the built-in deployment? It would be easier.

avatar
Explorer
I removed the version I installed and used the one available in CDH-5, it worked. Thanks!

avatar
Explorer

Hi,

 

I am unable to run even the sample vertification job in Scala. The worker node status is howing as alive with cores 4 (0 Used) and memory 6.7 GB (0.0 B Used). 

 

But I am repeatedly getting the below error. Could you please assist?

 

 

 

WARN scheduler.TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory

 

avatar
Master Collaborator

I believe that means you've requested more memory for a task than any worker has available, but people more knowledgeable might be able to confirm or deny that.

avatar
Explorer

But I have not manually requested memory anywhere or set any parameter.

 

So, is there a way to control this? Thanks!

avatar
Explorer

Apache Spark uses the derby database in background and hence only one instance of the 'Spark-Shell' can be connected at any time.

 

Is there any way to configure mysql or any other RDMS and is there any configuration document?

avatar
Explorer

My apologies to reawaken this dead thread, but the subject line was still applicable:

 

I am trying to get Spark 1.0 to run on on CDH5 YARN. That means NOT the provided 0.9 RPMs.

 

Still very new to Spark, I'm trying to get things right between what seems like somewhat contradictory information between Apache's Spark site and the CDH5 installation guide. This begins with precisely *what* is installed on *which* node in the cluster. Apache simply stated "on ALL nodes", without any distinction between node role - namenode? datanode? resource manager? node manager? application master? Then there's the client, say a development machine or similar.

 

The Cloudera docs state to only start the master on one node - yet, which machine is best chose for this? Coming from Mapreduce this would imply the namenode. But it's not made clear that's the right choice. Should it instead be one of the datanodes (in this case they are also the node managers and application masters for YARN, in other words the worker bees of the cluster)? Or maybe it should be the client machine from which jobs are launched through shell or submit?

 

The Apache cluster mode overview isn't much help either. I suppose that with Spark still being fresh and new, the doco has to catch up from a typical case of "this was totally obvious to the creators" for us mere mortals.

 

What makes this more confusing is that the 1.0 tarball from Apache is one large directory, as opposed to the RPMs which break the packages down. I assume I could simply install the tarball on each and every machine on the cluster, from client to namenode/resource manager to datanode/node manager and just push out configuration settings to all, but that seems somewhat "sloppy".

 

In the end it would be great if it were clear enough to take the 1.0 source and create my own RPMs, aligned with the current 0.9 CDH5 packages and install them only where needed, but for that I need a better understanding of what goes where.

 

Any suggestions and pointers welcome!

 

Thanks!