Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Spark on YARN in CDH-5

avatar
Explorer

Hi,

 

I am a newbie to Apache Spark.

 

I have installed CDH-5 using parcels (Beta 2 Version) and installed Spark also

 

 

As per the Spark installation documentation,  http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH5/latest/CDH5-Installation-Guide/c..., it is said,

 

Note:

  • The current version of CDH 5 does not support running Spark on YARN.
  • The current version of Spark does work in a secure cluster."

 

So, if YARN in CDH-5 does not support Spark, how do we run Spark in CDH-5?

 

Please let me know and also proivde any documentation if available.

 

Thanks!

2 ACCEPTED SOLUTIONS

avatar
Master Collaborator

At the moment, CDH5b2 deploys Spark in "standalone" mode: https://spark.apache.org/docs/0.9.0/spark-standalone.html

 

This simply means Spark tries to manage resources itself, rather than participating in a cluster manager like YARN or Mesos. As an end user, it shouldn't make much difference to you at all. Just fire up the shell and go.

 

Once a few kinks are worked out, Spark's YARN integration will be used in the future, as I understand.

View solution in original post

avatar
Master Collaborator

Are you on CDH5 beta 2? It already includes Spark. I wonder if its setup of Spark is interfering with whatever you have installed separately, or vice versa. Can you simply use the built-in deployment? It would be easier.

View solution in original post

25 REPLIES 25

avatar
Master Collaborator

You might wait for CDH 5.1.0, which will be released very soon. This deploys Spark 1.0.0+patches on YARN for you.

 

"Node" means a machine on which you want to run Spark. "Namenode" is for example an HDFS concept. It is not directly related to Spark. You may choose to run a Spark process on a machine that happens to host the namenode, or not. This is why Spark is not describe in terms of, say, HDFS roles.

 

You do not need to start the Spark master on the HDFS namenode. You didn't have to start the MR jobtracker on the namenode either. On a cluster I ran, I put the master on the namenode just since it's a simple default choice. But any machine that can see HDFS and YARN would be fine; it need not even be running other Hadoop services.

 

You can easily choose which machines are the Spark workers and which is the master in Cloudera Manager.


The Spark master is not the same thing as a client. Its role is like that of the jobtracker really. It would not be run outside the cluster. You may be thinking of a driver for your specific app.

 

The Apache distro is indeed a tarball and it's up to you to deploy it and run it. The role of CDH is to package, deploy and run things for you. The packaging is not at all the same, although the contents (scripts, binaries) are of course the same. You would not try to paste the raw tarball onto CDH nodes.

 

If you want to get adventurous, you can go to all machines and dig into /opt/cloudera/parcels/CDH/lib/spark and replace binaries with a newer compiled version. That's a manual process, and I suppose not 100% guaranteed to work, but you can try it.

avatar
Explorer

@srowen wrote:

You might wait for CDH 5.1.0, which will be released very soon. This deploys Spark 1.0.0+patches on YARN for you.

 

"Node" means a machine on which you want to run Spark. "Namenode" is for example an HDFS concept. It is not directly related to Spark. You may choose to run a Spark process on a machine that happens to host the namenode, or not. This is why Spark is not describe in terms of, say, HDFS roles.

 

You do not need to start the Spark master on the HDFS namenode. You didn't have to start the MR jobtracker on the namenode either. On a cluster I ran, I put the master on the namenode just since it's a simple default choice. But any machine that can see HDFS and YARN would be fine; it need not even be running other Hadoop services.

 

You can easily choose which machines are the Spark workers and which is the master in Cloudera Manager.


The Spark master is not the same thing as a client. Its role is like that of the jobtracker really. It would not be run outside the cluster. You may be thinking of a driver for your specific app.

 

The Apache distro is indeed a tarball and it's up to you to deploy it and run it. The role of CDH is to package, deploy and run things for you. The packaging is not at all the same, although the contents (scripts, binaries) are of course the same. You would not try to paste the raw tarball onto CDH nodes.

 

If you want to get adventurous, you can go to all machines and dig into /opt/cloudera/parcels/CDH/lib/spark and replace binaries with a newer compiled version. That's a manual process, and I suppose not 100% guaranteed to work, but you can try it.


Ah, thanks so much for that info! This covered a lot of my answers in one fell swoop.

 

I am of course aware that namenode/datanodes and RM/NM are not synonymous, however for simplicity sake in put them together as they frequently are.

 

My assumption was that the master was equivalent to a jobtracker, as you said and therefore would frequently be found on a NN/RM node, whereas the workers would go on a DN/NM/AM. Again, lumping those Hadoop components together.

 

If not, how would the workers access files on HDFS, unless by streaming? What would that do to performance? It's unfortunate that the Apache docs don't give a really detailed view of the architecture and component interaction both within Spark as well as with the various Hadoop components.

 

I think it's this that cause my confusion: if we're talking about Spark operating in a YARN environment, then there's a tacit implication of also having a "typical" underlying infrastructure based on your usual Hadoop cluster. Even if we take YARN out of the equation, if your data is on HDFS, then where do the Spark workers need to sit to ensure the maximum access speed? Talk of RDDs is all wel and good, but at some point your data is not all in memory, it's on platters whence it must get INTO memory! 🙂

 

When's 5.1 coming? 😉

avatar
Master Collaborator

This might help a lot: http://blog.cloudera.com/blog/2014/05/apache-spark-resource-management-and-yarn-app-models/

 

Yes you want Spark executors to end up colocated with datanodes or else data has to be accessed over the network a lot. It works but of course ideally workers all process only local data. You should get that if YARN nodemanagers are colocated with datanodes, since YARN is the thing running Spark's executors in its containers, when using YARN.

 

Things get confusing because of at least two things. First, there are two different types of YARN deployment, although, I don't think they affect how you think about placing services. But second, there is also "standalone" mode, the default in 0.9.0 and what you are currently using, wherein you actually do separately control where Spark workers run, separately from YARN.

 

I suppose I'd say the thing that matters is: datanodes and nodemanagers and spark worker services are present on all machines doing work.

avatar
Explorer

@srowen wrote:

This might help a lot: http://blog.cloudera.com/blog/2014/05/apache-spark-resource-management-and-yarn-app-models/

 

[...]

 

Things get confusing because of at least two things. First, there are two different types of YARN deployment, although, I don't think they affect how you think about placing services. But second, there is also "standalone" mode, the default in 0.9.0 and what you are currently using, wherein you actually do separately control where Spark workers run, separately from YARN.

 

I suppose I'd say the thing that matters is: datanodes and nodemanagers and spark worker services are present on all machines doing work.


 Many thanks for the blog link! I need to spend a week just catching up on the Cloudera blog one of these days. You guys produce some fantastic articles in there.

 

Yes, it was the mild confusion about 0.9 using this "standalone" mode instead of YARN which made me try and for for 1.0. Whether it was something I read on the Apache Spark pages or some other link via Google gave me the impression that the 0.9 RPMs are hard-wired for stand-alone mode and can't be configured for YARN instead. Could have also been a comment/note box in the CDH5 install guide? Anyway, I'm still digging through the config options to see whether 0.9 can be configured for YARN instead of standalone.

 

May get around to it this weekend, unless I crash and fall asleep, dang Worldcup! 😉

avatar
Contributor

Hello!!

 

I have a similar issue, I am having CDH 5 installed on my cluster (version Hadoop 2.3.0-cdh5.1.3)


I have installed and configured a prebuilt version of Spark 1.1.0 (Apache Version), built for hadoop 2.3 on my cluster.

when I run the Pi example in the ‘client mode’, it runs succesfully, but it fails in the ‘yarn-cluster’ mode. The spark job is successfully submitted, but fails after sometime saying:

 

***********************************
$ ./bin/spark-submit –class org.apache.spark.examples.SparkPi –master yarn-cluster –num-executors 2 –driver-memory 500m –executor-cores 2 lib/spark-examples*.jar 3

Logs:
14/11/05 20:47:47 INFO yarn.Client: Application report from ResourceManager:
application identifier: application_1415193640322_0013
appId: 13
clientToAMToken: null
appDiagnostics: Application application_1415193640322_0013 failed 2 times due to AM Container for appattempt_1415193640322_0013_000002 exited with exitCode: 1 due to: Exception from container-launch: org.apache.hadoop.util.Shell$ExitCodeException:
org.apache.hadoop.util.Shell$ExitCodeException:
***********************************

 

Can you please suggest any solution. Do you think I should compile the spark code on my cluster.
Or should I use Spark provided with CDH5.1

Any help will be appreciated!

avatar
Master Collaborator

Hm, why not just use the Spark that is part of CDH? If you want 1.1, can you update to CDH 5.2? Are there more logs? this isn't the underlying error.

avatar
Contributor

More Logs:

Application application_1415193640322_0016 failed 2 times due to Error launching appattempt_1415193640322_0016_000002. Got exception: org.apache.hadoop.yarn.exceptions.YarnException: java.io.EOFException
at org.apache.hadoop.yarn.ipc.RPCUtil.getRemoteException(RPCUtil.java:38)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.startContainers(ContainerManagerImpl.java:710)
at org.apache.hadoop.yarn.api.impl.pb.service.ContainerManagementProtocolPBServiceImpl.startContainers(ContainerManagementProtocolPBServiceImpl.java:60)
at org.apache.hadoop.yarn.proto.ContainerManagementProtocol$ContainerManagementProtocolService$2.callBlockingMethod(ContainerManagementProtocol.java:95)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:587)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
Caused by: java.io.EOFException
at java.io.DataInputStream.readFully(DataInputStream.java:197)
at java.io.DataInputStream.readUTF(DataInputStream.java:609)
at java.io.DataInputStream.readUTF(DataInputStream.java:564)
at org.apache.hadoop.yarn.security.ContainerTokenIdentifier.readFields(ContainerTokenIdentifier.java:151)
at org.apache.hadoop.security.token.Token.decodeIdentifier(Token.java:142)
at org.apache.hadoop.yarn.server.utils.BuilderUtils.newContainerTokenIdentifier(BuilderUtils.java:262)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.startContainers(ContainerManagerImpl.java:696)
... 10 more
 
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
at org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53)
at org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:101)
at org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.startContainers(ContainerManagementProtocolPBClientImpl.java:99)
at org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.launch(AMLauncher.java:118)
at org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.run(AMLauncher.java:249)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.yarn.exceptions.YarnException): java.io.EOFException
at org.apache.hadoop.yarn.ipc.RPCUtil.getRemoteException(RPCUtil.java:38)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.startContainers(ContainerManagerImpl.java:710)
at org.apache.hadoop.yarn.api.impl.pb.service.ContainerManagementProtocolPBServiceImpl.startContainers(ContainerManagementProtocolPBServiceImpl.java:60)
at org.apache.hadoop.yarn.proto.ContainerManagementProtocol$ContainerManagementProtocolService$2.callBlockingMethod(ContainerManagementProtocol.java:95)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:587)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
Caused by: java.io.EOFException
at java.io.DataInputStream.readFully(DataInputStream.java:197)
at java.io.DataInputStream.readUTF(DataInputStream.java:609)
at java.io.DataInputStream.readUTF(DataInputStream.java:564)
at org.apache.hadoop.yarn.security.ContainerTokenIdentifier.readFields(ContainerTokenIdentifier.java:151)
at org.apache.hadoop.security.token.Token.decodeIdentifier(Token.java:142)
at org.apache.hadoop.yarn.server.utils.BuilderUtils.newContainerTokenIdentifier(BuilderUtils.java:262)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.startContainers(ContainerManagerImpl.java:696)
... 10 more
 
at org.apache.hadoop.ipc.Client.call(Client.java:1409)
at org.apache.hadoop.ipc.Client.call(Client.java:1362)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
at com.sun.proxy.$Proxy69.startContainers(Unknown Source)
at org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.startContainers(ContainerManagementProtocolPBClientImpl.java:96)
... 5 more
. Failing the application.

 

 

 

When I go to node Manager logs:

 

Log Type: stderr

Log Length: 87

Error: Could not find or load main class org.apache.spark.deploy.yarn.ExecutorLauncher

avatar
Contributor

Looks like I have to try upgrading cdh to 5.2 and use SPARK that comes with it, but does support all modes of spark. i.e 'yarn-cluster', 'yarn-client' etc...

avatar
Explorer

Hi, I'm just a newbie and trying to run an example first to get to know how Spark work

 

I follow the link here: http://www.cloudera.com/content/cloudera/en/documentation/cdh5/v5-0-0/CDH5-Installation-Guide/cdh5ig...

I'm trying to run in YARN client mode and got this error: 

 

Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/spark/examples/SparkPi
Caused by: java.lang.ClassNotFoundException: org.apache.spark.examples.SparkPi
at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
Could not find the main class: org.apache.spark.examples.SparkPi. Program will exit.

 

I'm running CDH 5.2p0.36

Please help cause i don't even fully unserstand the guide in the link above.

 

Thanks you !

avatar
Master Collaborator

You should use the documentation for CDH 5.2, which you are using and which corresponds to Spark 1.1:

 

http://www.cloudera.com/content/cloudera/en/documentation/core/latest/topics/cdh_ig_running_spark_ap...

 

You are looking at docs for CDH 5.0.x, which corresponds to Spark 0.9. A lot has changed since then.