question Spark 1.1.0 on cdh5.1.3 does not work in yarn-cluster mode in Archives of Support Questions (Read Only)

Spark on YARN in CDH-5

ArunShell — Fri, 16 Sep 2022 08:55:54 GMT

Hi,

I am a newbie to Apache Spark.

I have installed CDH-5 using parcels (Beta 2 Version) and installed Spark also

As per the Spark installation documentation, http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH5/latest/CDH5-Installation-Guide/cdh5ig_Spark_prerequisites.html#../CDH5-Installation-Guide/../CDH5-Installation-Guide/cdh5ig_Spark_configuring.html, it is said,

" Note:

The current version of CDH 5 does not support running Spark on YARN.
The current version of Spark does work in a secure cluster."

So, if YARN in CDH-5 does not support Spark, how do we run Spark in CDH-5?

Please let me know and also proivde any documentation if available.

Thanks!

Re: Spark on YARN in CDH-5

srowen — Mon, 24 Mar 2014 13:51:50 GMT

At the moment, CDH5b2 deploys Spark in "standalone" mode: https://spark.apache.org/docs/0.9.0/spark-standalone.html

This simply means Spark tries to manage resources itself, rather than participating in a cluster manager like YARN or Mesos. As an end user, it shouldn't make much difference to you at all. Just fire up the shell and go.

Once a few kinks are worked out, Spark's YARN integration will be used in the future, as I understand.

Re: Spark on YARN in CDH-5

ArunShell — Tue, 25 Mar 2014 11:24:44 GMT

Hi,

I have installed Spark in Stand alone mode in CDH-5 cluster. But when I start the Spark Master, I am getting the following error -

Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/spark/deploy/master/Master
Caused by: java.lang.ClassNotFoundException: org.apache.spark.deploy.master.Master
        at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
        at java.security.AccessController.doPrivileged(Native Method)
        at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
        at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
Could not find the main class: org.apache.spark.deploy.master.Master.  Program will exit.

I have provided the folder '/usr/lib/spark' in class path and also set the variable ' SPARK_LIBRARY_PATH=/usr/lib/spark' in the file spark-env.sh.

Still I am facing this error.

I installed SPARK using yum.

Could you please assist? Thanks!

Re: Spark on YARN in CDH-5

srowen — Tue, 25 Mar 2014 11:29:12 GMT

Are you on CDH5 beta 2? It already includes Spark. I wonder if its setup of Spark is interfering with whatever you have installed separately, or vice versa. Can you simply use the built-in deployment? It would be easier.

Re: Spark on YARN in CDH-5

ArunShell — Thu, 27 Mar 2014 12:15:28 GMT

I removed the version I installed and used the one available in CDH-5, it worked. Thanks!

Re: Spark on YARN in CDH-5

ArunShell — Wed, 02 Apr 2014 14:22:34 GMT

Hi,

I am unable to run even the sample vertification job in Scala. The worker node status is howing as alive with cores 4 (0 Used) and memory 6.7 GB (0.0 B Used).

But I am repeatedly getting the below error. Could you please assist?

WARN scheduler.TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory

Re: Spark on YARN in CDH-5

srowen — Wed, 02 Apr 2014 17:00:59 GMT

I believe that means you've requested more memory for a task than any worker has available, but people more knowledgeable might be able to confirm or deny that.

Re: Spark on YARN in CDH-5

ArunShell — Thu, 03 Apr 2014 04:38:55 GMT

But I have not manually requested memory anywhere or set any parameter.

So, is there a way to control this? Thanks!

Re: Spark on YARN in CDH-5

ArunShell — Mon, 07 Apr 2014 07:01:27 GMT

Apache Spark uses the derby database in background and hence only one instance of the 'Spark-Shell' can be connected at any time.

Is there any way to configure mysql or any other RDMS and is there any configuration document?

Re: Spark on YARN in CDH-5

Marakai — Fri, 11 Jul 2014 12:02:54 GMT

My apologies to reawaken this dead thread, but the subject line was still applicable:

I am trying to get Spark 1.0 to run on on CDH5 YARN. That means NOT the provided 0.9 RPMs.

Still very new to Spark, I'm trying to get things right between what seems like somewhat contradictory information between Apache's Spark site and the CDH5 installation guide. This begins with precisely *what* is installed on *which* node in the cluster. Apache simply stated "on ALL nodes", without any distinction between node role - namenode? datanode? resource manager? node manager? application master? Then there's the client, say a development machine or similar.

The Cloudera docs state to only start the master on one node - yet, which machine is best chose for this? Coming from Mapreduce this would imply the namenode. But it's not made clear that's the right choice. Should it instead be one of the datanodes (in this case they are also the node managers and application masters for YARN, in other words the worker bees of the cluster)? Or maybe it should be the client machine from which jobs are launched through shell or submit?

The Apache cluster mode overview isn't much help either. I suppose that with Spark still being fresh and new, the doco has to catch up from a typical case of "this was totally obvious to the creators" for us mere mortals.

What makes this more confusing is that the 1.0 tarball from Apache is one large directory, as opposed to the RPMs which break the packages down. I assume I could simply install the tarball on each and every machine on the cluster, from client to namenode/resource manager to datanode/node manager and just push out configuration settings to all, but that seems somewhat "sloppy".

In the end it would be great if it were clear enough to take the 1.0 source and create my own RPMs, aligned with the current 0.9 CDH5 packages and install them only where needed, but for that I need a better understanding of what goes where.

Any suggestions and pointers welcome!

Thanks!

Re: Spark on YARN in CDH-5

srowen — Fri, 11 Jul 2014 12:28:58 GMT

You might wait for CDH 5.1.0, which will be released very soon. This deploys Spark 1.0.0+patches on YARN for you.

"Node" means a machine on which you want to run Spark. "Namenode" is for example an HDFS concept. It is not directly related to Spark. You may choose to run a Spark process on a machine that happens to host the namenode, or not. This is why Spark is not describe in terms of, say, HDFS roles.

You do not need to start the Spark master on the HDFS namenode. You didn't have to start the MR jobtracker on the namenode either. On a cluster I ran, I put the master on the namenode just since it's a simple default choice. But any machine that can see HDFS and YARN would be fine; it need not even be running other Hadoop services.

You can easily choose which machines are the Spark workers and which is the master in Cloudera Manager.

The Spark master is not the same thing as a client. Its role is like that of the jobtracker really. It would not be run outside the cluster. You may be thinking of a driver for your specific app.

The Apache distro is indeed a tarball and it's up to you to deploy it and run it. The role of CDH is to package, deploy and run things for you. The packaging is not at all the same, although the contents (scripts, binaries) are of course the same. You would not try to paste the raw tarball onto CDH nodes.

If you want to get adventurous, you can go to all machines and dig into /opt/cloudera/parcels/CDH/lib/spark and replace binaries with a newer compiled version. That's a manual process, and I suppose not 100% guaranteed to work, but you can try it.

Re: Spark on YARN in CDH-5

Marakai — Fri, 11 Jul 2014 13:16:52 GMT

@srowen wrote:
You might wait for CDH 5.1.0, which will be released very soon. This deploys Spark 1.0.0+patches on YARN for you.

"Node" means a machine on which you want to run Spark. "Namenode" is for example an HDFS concept. It is not directly related to Spark. You may choose to run a Spark process on a machine that happens to host the namenode, or not. This is why Spark is not describe in terms of, say, HDFS roles.

You do not need to start the Spark master on the HDFS namenode. You didn't have to start the MR jobtracker on the namenode either. On a cluster I ran, I put the master on the namenode just since it's a simple default choice. But any machine that can see HDFS and YARN would be fine; it need not even be running other Hadoop services.

You can easily choose which machines are the Spark workers and which is the master in Cloudera Manager.

The Spark master is not the same thing as a client. Its role is like that of the jobtracker really. It would not be run outside the cluster. You may be thinking of a driver for your specific app.

The Apache distro is indeed a tarball and it's up to you to deploy it and run it. The role of CDH is to package, deploy and run things for you. The packaging is not at all the same, although the contents (scripts, binaries) are of course the same. You would not try to paste the raw tarball onto CDH nodes.

If you want to get adventurous, you can go to all machines and dig into /opt/cloudera/parcels/CDH/lib/spark and replace binaries with a newer compiled version. That's a manual process, and I suppose not 100% guaranteed to work, but you can try it.

Ah, thanks so much for that info! This covered a lot of my answers in one fell swoop.

I am of course aware that namenode/datanodes and RM/NM are not synonymous, however for simplicity sake in put them together as they frequently are.

My assumption was that the master was equivalent to a jobtracker, as you said and therefore would frequently be found on a NN/RM node, whereas the workers would go on a DN/NM/AM. Again, lumping those Hadoop components together.

If not, how would the workers access files on HDFS, unless by streaming? What would that do to performance? It's unfortunate that the Apache docs don't give a really detailed view of the architecture and component interaction both within Spark as well as with the various Hadoop components.

I think it's this that cause my confusion: if we're talking about Spark operating in a YARN environment, then there's a tacit implication of also having a "typical" underlying infrastructure based on your usual Hadoop cluster. Even if we take YARN out of the equation, if your data is on HDFS, then where do the Spark workers need to sit to ensure the maximum access speed? Talk of RDDs is all wel and good, but at some point your data is not all in memory, it's on platters whence it must get INTO memory! 🙂

When's 5.1 coming? 😉

Re: Spark on YARN in CDH-5

srowen — Fri, 11 Jul 2014 13:49:49 GMT

This might help a lot: http://blog.cloudera.com/blog/2014/05/apache-spark-resource-management-and-yarn-app-models/

Yes you want Spark executors to end up colocated with datanodes or else data has to be accessed over the network a lot. It works but of course ideally workers all process only local data. You should get that if YARN nodemanagers are colocated with datanodes, since YARN is the thing running Spark's executors in its containers, when using YARN.

Things get confusing because of at least two things. First, there are two different types of YARN deployment, although, I don't think they affect how you think about placing services. But second, there is also "standalone" mode, the default in 0.9.0 and what you are currently using, wherein you actually do separately control where Spark workers run, separately from YARN.

I suppose I'd say the thing that matters is: datanodes and nodemanagers and spark worker services are present on all machines doing work.

Re: Spark on YARN in CDH-5

Marakai — Fri, 11 Jul 2014 23:13:11 GMT

@srowen wrote:
This might help a lot: http://blog.cloudera.com/blog/2014/05/apache-spark-resource-management-and-yarn-app-models/

[...]

Things get confusing because of at least two things. First, there are two different types of YARN deployment, although, I don't think they affect how you think about placing services. But second, there is also "standalone" mode, the default in 0.9.0 and what you are currently using, wherein you actually do separately control where Spark workers run, separately from YARN.

I suppose I'd say the thing that matters is: datanodes and nodemanagers and spark worker services are present on all machines doing work.

Many thanks for the blog link! I need to spend a week just catching up on the Cloudera blog one of these days. You guys produce some fantastic articles in there.

Yes, it was the mild confusion about 0.9 using this "standalone" mode instead of YARN which made me try and for for 1.0. Whether it was something I read on the Apache Spark pages or some other link via Google gave me the impression that the 0.9 RPMs are hard-wired for stand-alone mode and can't be configured for YARN instead. Could have also been a comment/note box in the CDH5 install guide? Anyway, I'm still digging through the config options to see whether 0.9 can be configured for YARN instead of standalone.

May get around to it this weekend, unless I crash and fall asleep, dang Worldcup! 😉

Spark 1.1.0 on cdh5.1.3 does not work in yarn-cluster mode

Rakesh Gupta — Tue, 11 Nov 2014 13:04:40 GMT

Hello!!

I have a similar issue, I am having CDH 5 installed on my cluster (version Hadoop 2.3.0-cdh5.1.3)

I have installed and configured a prebuilt version of Spark 1.1.0 (Apache Version), built for hadoop 2.3 on my cluster.

when I run the Pi example in the ‘client mode’, it runs succesfully, but it fails in the ‘yarn-cluster’ mode. The spark job is successfully submitted, but fails after sometime saying:

***********************************
$ ./bin/spark-submit –class org.apache.spark.examples.SparkPi –master yarn-cluster –num-executors 2 –driver-memory 500m –executor-cores 2 lib/spark-examples*.jar 3

Logs:
14/11/05 20:47:47 INFO yarn.Client: Application report from ResourceManager:
application identifier: application_1415193640322_0013
appId: 13
clientToAMToken: null
appDiagnostics: Application application_1415193640322_0013 failed 2 times due to AM Container for appattempt_1415193640322_0013_000002 exited with exitCode: 1 due to: Exception from container-launch: org.apache.hadoop.util.Shell$ExitCodeException:
org.apache.hadoop.util.Shell$ExitCodeException:
***********************************

Can you please suggest any solution. Do you think I should compile the spark code on my cluster.
Or should I use Spark provided with CDH5.1

Any help will be appreciated!

Re: Spark 1.1.0 on cdh5.1.3 does not work in yarn-cluster mode

srowen — Tue, 11 Nov 2014 13:22:12 GMT

Hm, why not just use the Spark that is part of CDH? If you want 1.1, can you update to CDH 5.2? Are there more logs? this isn't the underlying error.

Re: Spark 1.1.0 on cdh5.1.3 does not work in yarn-cluster mode

Rakesh Gupta — Tue, 11 Nov 2014 14:09:02 GMT

More Logs:

Application application_1415193640322_0016 failed 2 times due to Error launching appattempt_1415193640322_0016_000002. Got exception: org.apache.hadoop.yarn.exceptions.YarnException: java.io.EOFException

at org.apache.hadoop.yarn.ipc.RPCUtil.getRemoteException(RPCUtil.java:38)

at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.startContainers(ContainerManagerImpl.java:710)

at org.apache.hadoop.yarn.api.impl.pb.service.ContainerManagementProtocolPBServiceImpl.startContainers(ContainerManagementProtocolPBServiceImpl.java:60)

at org.apache.hadoop.yarn.proto.ContainerManagementProtocol$ContainerManagementProtocolService$2.callBlockingMethod(ContainerManagementProtocol.java:95)

at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:587)

at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026)

at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)

at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)

at java.security.AccessController.doPrivileged(Native Method)

at javax.security.auth.Subject.doAs(Subject.java:415)

at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)

at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)

Caused by: java.io.EOFException

at java.io.DataInputStream.readFully(DataInputStream.java:197)

at java.io.DataInputStream.readUTF(DataInputStream.java:609)

at java.io.DataInputStream.readUTF(DataInputStream.java:564)

at org.apache.hadoop.yarn.security.ContainerTokenIdentifier.readFields(ContainerTokenIdentifier.java:151)

at org.apache.hadoop.security.token.Token.decodeIdentifier(Token.java:142)

at org.apache.hadoop.yarn.server.utils.BuilderUtils.newContainerTokenIdentifier(BuilderUtils.java:262)

at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.startContainers(ContainerManagerImpl.java:696)

... 10 more

at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)

at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)

at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)

at java.lang.reflect.Constructor.newInstance(Constructor.java:526)

at org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53)

at org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:101)

at org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.startContainers(ContainerManagementProtocolPBClientImpl.java:99)

at org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.launch(AMLauncher.java:118)

at org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.run(AMLauncher.java:249)

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

at java.lang.Thread.run(Thread.java:744)

Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.yarn.exceptions.YarnException): java.io.EOFException

at org.apache.hadoop.yarn.ipc.RPCUtil.getRemoteException(RPCUtil.java:38)

at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.startContainers(ContainerManagerImpl.java:710)

at org.apache.hadoop.yarn.api.impl.pb.service.ContainerManagementProtocolPBServiceImpl.startContainers(ContainerManagementProtocolPBServiceImpl.java:60)

at org.apache.hadoop.yarn.proto.ContainerManagementProtocol$ContainerManagementProtocolService$2.callBlockingMethod(ContainerManagementProtocol.java:95)

at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:587)

at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026)

at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)

at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)

at java.security.AccessController.doPrivileged(Native Method)

at javax.security.auth.Subject.doAs(Subject.java:415)

at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)

at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)

Caused by: java.io.EOFException

at java.io.DataInputStream.readFully(DataInputStream.java:197)

at java.io.DataInputStream.readUTF(DataInputStream.java:609)

at java.io.DataInputStream.readUTF(DataInputStream.java:564)

at org.apache.hadoop.yarn.security.ContainerTokenIdentifier.readFields(ContainerTokenIdentifier.java:151)

at org.apache.hadoop.security.token.Token.decodeIdentifier(Token.java:142)

at org.apache.hadoop.yarn.server.utils.BuilderUtils.newContainerTokenIdentifier(BuilderUtils.java:262)

at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.startContainers(ContainerManagerImpl.java:696)

... 10 more

at org.apache.hadoop.ipc.Client.call(Client.java:1409)

at org.apache.hadoop.ipc.Client.call(Client.java:1362)

at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)

at com.sun.proxy.$Proxy69.startContainers(Unknown Source)

at org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.startContainers(ContainerManagementProtocolPBClientImpl.java:96)

... 5 more

. Failing the application.

When I go to node Manager logs:

Log Type: stderr

Log Length: 87

Error: Could not find or load main class org.apache.spark.deploy.yarn.ExecutorLauncher

Re: Spark 1.1.0 on cdh5.1.3 does not work in yarn-cluster mode

Rakesh Gupta — Fri, 14 Nov 2014 11:43:09 GMT

Looks like I have to try upgrading cdh to 5.2 and use SPARK that comes with it, but does support all modes of spark. i.e 'yarn-cluster', 'yarn-client' etc...

Re: Spark on YARN in CDH-5

MabuXayda — Tue, 18 Nov 2014 09:47:24 GMT

Hi, I'm just a newbie and trying to run an example first to get to know how Spark work

I follow the link here: http://www.cloudera.com/content/cloudera/en/documentation/cdh5/v5-0-0/CDH5-Installation-Guide/cdh5ig_running_spark_apps.html?scroll=concept_w24_rsc_nn_unique_1

I'm trying to run in YARN client mode and got this error:

Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/spark/examples/SparkPi
Caused by: java.lang.ClassNotFoundException: org.apache.spark.examples.SparkPi
at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
Could not find the main class: org.apache.spark.examples.SparkPi. Program will exit.

I'm running CDH 5.2p0.36

Please help cause i don't even fully unserstand the guide in the link above.

Thanks you !

Re: Spark on YARN in CDH-5

srowen — Tue, 18 Nov 2014 09:57:53 GMT

You should use the documentation for CDH 5.2, which you are using and which corresponds to Spark 1.1:

http://www.cloudera.com/content/cloudera/en/documentation/core/latest/topics/cdh_ig_running_spark_apps.html

You are looking at docs for CDH 5.0.x, which corresponds to Spark 0.9. A lot has changed since then.