Support Questions

Find answers, ask questions, and share your expertise

Spark on YARN in CDH-5

avatar
Explorer

Hi,

 

I am a newbie to Apache Spark.

 

I have installed CDH-5 using parcels (Beta 2 Version) and installed Spark also

 

 

As per the Spark installation documentation,  http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH5/latest/CDH5-Installation-Guide/c..., it is said,

 

Note:

  • The current version of CDH 5 does not support running Spark on YARN.
  • The current version of Spark does work in a secure cluster."

 

So, if YARN in CDH-5 does not support Spark, how do we run Spark in CDH-5?

 

Please let me know and also proivde any documentation if available.

 

Thanks!

2 ACCEPTED SOLUTIONS

avatar
Master Collaborator

At the moment, CDH5b2 deploys Spark in "standalone" mode: https://spark.apache.org/docs/0.9.0/spark-standalone.html

 

This simply means Spark tries to manage resources itself, rather than participating in a cluster manager like YARN or Mesos. As an end user, it shouldn't make much difference to you at all. Just fire up the shell and go.

 

Once a few kinks are worked out, Spark's YARN integration will be used in the future, as I understand.

View solution in original post

avatar
Master Collaborator

Are you on CDH5 beta 2? It already includes Spark. I wonder if its setup of Spark is interfering with whatever you have installed separately, or vice versa. Can you simply use the built-in deployment? It would be easier.

View solution in original post

25 REPLIES 25

avatar
Explorer

Thanks for the help,

 

I follow the instruction and get this error:

Error: Cannot load main class from JAR: file:/var/lib/hadoop-hdfs/class

 

Can you give any advise ?

 

Thanks !

avatar
Master Collaborator

That sounds like a bad command line. I don't see that path in the instructions either. Check that you are following the instructions for 5.2 in the previous link.

avatar
Explorer

Thanks for your reply sowen,

 

I'm just trying with another link: https://spark.apache.org/docs/1.1.0/running-on-yarn.html and it work.

I got the result:

14:52:41 INFO Client: Application report from ResourceManager: 
           application identifier: application_1416365742014_0003
           appId: 3
           clientToAMToken: null
           appDiagnostics: 
           appMasterHost: 01slave.mabu.com
           appQueue: root.root
           appMasterRpcPort: 0
           appStartTime: 1416383498088
           yarnAppState: FINISHED
           distributedFinalState: SUCCEEDED
           appTrackingUrl: http://00master.mabu.com:8088/proxy/application_1416365742014_0003/history/spark-pi-1416383528301
           appUser: root

 

Problem is i can't find where the result of Pi is like when we run Pi example on Hadoop (it'll print the resutl 3.14333...) , where can i find it ?

 

Thanks !

 

avatar
Master Collaborator

Yes, in that example you are clearly running on YARN. So you see it in the history, right?

 

It looks like the example uses yarn-cluster mode, which means the driver was launched on YARN, not locally. The output will be on the YARN container that had the driver.

 

Try yarn-client instead to make your local process the driver and it should print the result on your console.

avatar
Explorer

Thanks again owen,

 

The example go well, i can see the Pi result now, still got some error :

WARN YarnClientClusterScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory

ERROR ConnectionManager: Corresponding SendingConnection to ConnectionManagerId(03slave.mabu.com,42930) not found

WARN ConnectionManager: All connections not cleaned up.

 

Don't know if it's because of the poor connection or the amount of RAM on my cluster, but this is still a good start for me anyway.

By the way, do you know where i can find more information about Spark system ( how it work, it's operation,  when to user yarn-clsuter/yarn-client ...).

 

Thanks alot !

avatar
Master Collaborator
It looks like you asked for more resources than you configured YARN to offer, so check how much you can allocate in YARN and how much Spark asked for. I don't know about the ERROR; it may be a red herring. Please have a look at http://spark.apache.org/docs/latest/ for pretty good Spark docs.