I correctly understand that CDH 5.5 doesn't include the SparkR?
Can anybody point me to the guide how to install SparkR on Cloudera, if there is something else than official SparkR guide?
it seems there is not compatibility between spark in yarn mode and sparkR:
However in the github repository we can find installation instructions for spark-R on YARN.
The big problem is that it seems that cloudera has cut the sparkR libs and binaries from their spark distro, so I am not sure how to incorporate them. A different standalone install of spark including sparkR could work, but I would hesitate from interfering with the cloudera files, I'd look into it.
No luck, it seems that the independent package requires spark version <= 1.2
so we would need to switch to an independent spark standalone installation.
Any movement on this? Is it correct that we're still not going to get SparkR even in the upcoming CDH 5.6? BTW installing R in all nodes would not be a problem as we can do it easily with Puppet.
There is actually a rather simple workaround, provided that
a) R is installed in all working nodes
b) you access the cluster from a gateway node, in which you have a user account
I have provided the details and a simple demonstration in a blog post here:
Why you affirm that SparkR "doesn't work without R installed on all the cluster machines"?
I thought that SparkR was only a client-side R library that provide you an R interface, but the actual computation on worker nodes is performed invoking Scala code (unless you use UDF, but i think that actually are not supported in SparkR).
I have a CDH 5.7.0 Kerberized cluster and I have built Spark version 1.6.0-cdh5.7.0 in a Docker container running on a gateway node. I use SparkR through Jupyter with R Kernel .
R installed isn't installed on cluster nodes.
From few tests seems that I can submit Spark job on YARN from the R Kernel of Jupyter without problems.
I have a coexisting spark 1.6.1 Spark standalone working in my cluster slong with CDH 5.5. Obviously I do not launch this alternative spark over Yarn, but I acces HDFS and other resources.
I can confirm you that SparkR works with R installed only in the master node...
It also runs R worker processes. Not everything can be pushed down to the JVM -- like R lambdas. There's really not much documentation about this, except maybe the config settings for it: http://spark.apache.org/docs/latest/configuration.html#sparkr
I'm not sure how you're running, but, maybe you are running entirely locally in your container, or, simply haven't done anything that would require the sparkr worker process?