Support Questions
Find answers, ask questions, and share your expertise

Spark - R (cran) integration

Hello, We have models written in R (cran.R). How do we invoke these existing models written in cran.R from “Spark”? Will existing models in R run in distributed/scalable manner as MLLIB? Again we have “SparkR” is this R version provided by spark – which is different from cran.R. Does sparkR provide all ML libraries as provided by cran.R? Some of the models like “MaxEntropy” are not available in MLLIB but available in cran.R – can we invoke such models from spark? Thanks

Master Collaborator
Spark does not parallelize or affect CRAN packages in any way. You
can't execute regular R code in a distributed way.
SparkR is not a distribution of R. It is an API for using Spark from
R. SparkR provides no CRAN libraries at all. It is just an interface
to write your own Spark programs in R.


Thanks Srowen.
As I understand - "SparkR is just an interface to write your own Spark programs in R" - Does this imply spark does not have its own R packages/libraries but needs to fallback on other R implmentations like cran.R. This means we need to write R programs in sparkR and call libraries from cran.R. Our problem is for text classificaton "maxEntropy" provides higher accuracy than "naiveBayes" - "maxEntropy" is not available in mllib and only option is to use library from R ("RTextTools"). Any example from spark invoking R library/package will be useful.

Master Collaborator
Spark's package is an API to Spark. It does not use, nor is it used
by, CRAN packages that provide, say, some kind of statistical
function. You can't use the "maxEntropy" package you're referring to
with SparkR.


Below link says "includePackage" is an option - can this be used by sparkR to add "RTextTools" cran package?


using existing R packages

SparkR also allows easy use of existing R packages inside closures. The includePackagecommand can be used to indicate packages that should be loaded before every closure is executed on the cluster. For example to use the Matrix in a closure applied on each partition of an RDD, you could run



As you have mentioned sparkR is just an interface -> that means sparkR by itself has no packages and has to depend on other implementations like "cran.R" - correct.


Another defination says

SparkR is an R package provides light-weight frontend to use apache spark from R. SparkR allows easy use of existing R packages inside closures. Spark computations are automatically distributed across all the cores and machines available on the Spark cluster, so this package can be used to analyze terabytes of data using R.

Master Collaborator
Yes, you can call any package you like from a function. But the
packages will not use Spark. They're executing on one node as usual.
This is just fine if you want to do some local transformation with a
package, but not if you're expecting it will parallelize a decision
forest or something. You can call through to some of the
implementations in Spark MLlib for that.


Thanks  - will try to invoke R package and keep you updated - if any links/examples for this kindly share.

Attempted to access cran RTextTools from spark. This was done by copying from cran R library folder to spark/lib folder. Also tried "includePackage" Error: could not find function "includePackage" Error: package ‘RTextTools’ was built for x86_64-w64-mingw32 In addition: Warning message: package ‘RTextTools’ was built under R version 3.2.4 Execution halted Another doubt I have - spark MLLIB function - how are they accessed from sparkR , I guess no need to add library/packages - sparkR will automatically resolve them? But from sparkR what will be call signature? Thanks