Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

SparkR in CDH 5.5?

SparkR in CDH 5.5?

New Contributor

I correctly understand that CDH 5.5 doesn't include the SparkR?

Can anybody point me to the guide how to install SparkR on Cloudera, if there is something else than official SparkR guide?

 

Thank you

16 REPLIES 16

Re: SparkR in CDH 5.5?

Hi,

 

it seems there is not compatibility between spark in yarn mode and sparkR:

 

http://qnalist.com/questions/4848722/sparkr-is-it-possible-to-run-sparkr-on-yarn

 

However in the github repository we can find installation instructions for spark-R on YARN.

 

https://github.com/amplab-extras/SparkR-pkg

 

The big problem is that it seems that cloudera has cut the sparkR libs and binaries from their spark distro, so I am not sure how to incorporate them. A different standalone install of spark including sparkR could work, but I would hesitate from interfering with the cloudera files, I'd look into it.

 

 

EDIT:

 

No luck, it seems that the independent package requires spark version <= 1.2

 

so we would need to switch to an independent spark standalone installation.

Re: SparkR in CDH 5.5?

Master Collaborator
Yes, there are a couple issues here. SparkR is still pretty "alpha"
and does not work with other resource managers, hence far from
supported. Normally it'd still be shipped, but, it also doesn't work
without R installed on all the cluster machines. Unlike Python it's
not generally already available. And that in turn can't even be
shipped directly by CDH since it's GPL. So some of the bits aren't
even there like the sparkr script to avoid confusion. It would just
error out every time.

However should be no problem to build the bits yourself if you want
and just run a separate standalone cluster on the same machines, after
you get all the R dependencies installed.

Re: SparkR in CDH 5.5?

Contributor

Any movement on this? Is it correct that we're still not going to get SparkR even in the upcoming CDH 5.6?  BTW installing R in all nodes would not be a problem as we can do it easily with Puppet.

Highlighted

Re: SparkR in CDH 5.5?

Master Collaborator
CDH 5.6 is already released and does not support SparkR; although I
can't promise I know it *won't* be in CDH5.7, I don't believe it is.

It's actually in the tree
(https://github.com/cloudera/spark/tree/cdh5-1.5.0_5.6.0/R) and any
similarly-versioned Spark distribution could probably be built such
that its SparkR works with CDH. You'd have to build it manually. The
big issue is that yeah you'd also have to manually install R and all
the supporting packages on your cluster.

Re: SparkR in CDH 5.5?

New Contributor

There is actually a rather simple workaround, provided that

a) R is installed in all working nodes

b) you access the cluster from a gateway node, in which you have a user account

 

I have provided the details and a simple demonstration in a blog post here:

 

http://www.nodalpoint.com/sparkr-in-cloudera-hadoop/

Re: SparkR in CDH 5.5?

Contributor

ctsats,

 

On CentOS/RHEL, is there a correct way to install R on the worker nodes?  Would you recommend

yum install R

or

yum install R-core

?

Re: SparkR in CDH 5.5?

New Contributor

Why you affirm that SparkR "doesn't work without R installed on all the cluster machines"?
I thought that SparkR was only a client-side R library that provide you an R interface, but the actual computation on worker nodes is performed invoking Scala code (unless you use UDF, but i think that actually are not supported in SparkR).

 

I have a CDH 5.7.0 Kerberized cluster and I have built Spark version 1.6.0-cdh5.7.0 in a Docker container running on a gateway node. I use SparkR through Jupyter with R Kernel .
R installed isn't installed on cluster nodes.
From few tests seems that I can submit Spark job on YARN from the R Kernel of Jupyter without problems.

Re: SparkR in CDH 5.5?

I have a coexisting spark 1.6.1 Spark standalone working in my cluster slong with CDH 5.5. Obviously I do not launch this alternative spark over Yarn, but I acces HDFS and other resources.

 

I can confirm you that SparkR works with R installed only in the master node... 

 

 

Re: SparkR in CDH 5.5?

Master Collaborator

It also runs R worker processes. Not everything can be pushed down to the JVM -- like R lambdas. There's really not much documentation about this, except maybe the config settings for it: http://spark.apache.org/docs/latest/configuration.html#sparkr

 

I'm not sure how you're running, but, maybe you are running entirely locally in your container, or, simply haven't done anything that would require the sparkr worker process?

Don't have an account?
Coming from Hortonworks? Activate your account here