Support Questions

Find answers, ask questions, and share your expertise

Spark on R vs R on Spark (SparkR) ?

avatar

What is the performance difference between Spark on R and R on Spark.

1 ACCEPTED SOLUTION

avatar
Super Guru

@Sandeep Nemuri

Without knowing the problem you are trying to resolve, there is no fair comparison between two architectures that are meant to address different functionality. It depends on the type of problems you are trying to resolve. Sorry to add this caveat and not to provide a silver bullet.

#1: if you have massive data stored and you need the distributed power of Spark to fetch the data and generate the data frames which are then consumed by the R application, then the following architecture is recommended.

R client -> R Server -> SparkR -> Spark -> Data Source (usually Hadoop)

#2: if you are building Spark applications that eventually need access to some R functions that are delivered by the R server then your architecture looks more like this:

Spark client -> Spark on YARN -> SparkR -> R Server

The first case would be what you call R on Spark. The second case would be what you call Spark on R.

My observation on multiple customers is that Spark applications use #2 and R applications use #1. A data scientist that uses R as his/her tool and only needs Spark to handle the massive data back and forth, will use #1. A data scientist or an application developer that needs to deliver a Spark application and wants to leverage some existent R functionality will use #2.

It is obvious that due to its distributed nature and taking advantage of resources from multiple nodes, an architecture like #1 will benefit an R application that needs Spark's muscle in the cluster, while #2 already has the Spark muscle and brain and will need also some of the R brain. Your R server has lots of brain but not a lot of muscle when is to compare it with Spark on YARN over a large cluster with lots of CPU and RAM.

If a response helped to shed some light please don't forget to vote or accept the best answer. If you have a better answer, please add it and a moderator will review it and eventually accept it.

View solution in original post

4 REPLIES 4

avatar
Super Collaborator

Hi,

Not sure what you mean with Spark on R? If that is RHadoop, MS R Server or DistributedR those are not really comparable just like that with SparkR, ie no drop-in replacement, as the functionality differs. So before doing a performance comparison you have to look into which functionality you would like to compare.

/Best regards, Mats

avatar

@Mats Johansson By spark on R, I mean Running Spark on R server.

Which is the recommended one ? Spark on R vs SparkR ? Would also like to know the performance between both of them.

avatar
Super Guru

@Sandeep Nemuri

Could you answer @Mats Johansson? I am interested in the meaning of your question ... The query seems abandoned and the community needs to understand the question and the answer.

avatar
Super Guru

@Sandeep Nemuri

Without knowing the problem you are trying to resolve, there is no fair comparison between two architectures that are meant to address different functionality. It depends on the type of problems you are trying to resolve. Sorry to add this caveat and not to provide a silver bullet.

#1: if you have massive data stored and you need the distributed power of Spark to fetch the data and generate the data frames which are then consumed by the R application, then the following architecture is recommended.

R client -> R Server -> SparkR -> Spark -> Data Source (usually Hadoop)

#2: if you are building Spark applications that eventually need access to some R functions that are delivered by the R server then your architecture looks more like this:

Spark client -> Spark on YARN -> SparkR -> R Server

The first case would be what you call R on Spark. The second case would be what you call Spark on R.

My observation on multiple customers is that Spark applications use #2 and R applications use #1. A data scientist that uses R as his/her tool and only needs Spark to handle the massive data back and forth, will use #1. A data scientist or an application developer that needs to deliver a Spark application and wants to leverage some existent R functionality will use #2.

It is obvious that due to its distributed nature and taking advantage of resources from multiple nodes, an architecture like #1 will benefit an R application that needs Spark's muscle in the cluster, while #2 already has the Spark muscle and brain and will need also some of the R brain. Your R server has lots of brain but not a lot of muscle when is to compare it with Spark on YARN over a large cluster with lots of CPU and RAM.

If a response helped to shed some light please don't forget to vote or accept the best answer. If you have a better answer, please add it and a moderator will review it and eventually accept it.