Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Model training outside of edge node?

avatar
Master Guru

Can model training happen on the cluster instead of edge node? Everything I have seen and experienced has been where data scientist has run their model on the edge node and claims it can not be done on the cluster. Why?

1 ACCEPTED SOLUTION

avatar
Guru

Isn't the issue that Spark R support simply honors the syntax of R and still relies on Spark MLLib for any distributed processing? I don't believe R libraries were designed to run in distributed mode. I think as Scala gains more prominence with the data science community, Spark MLLib acquires more algorithms, and Zeppelin acquires more useful visualizations, much more data science will be done directly on the cluster. The advantages are self explanatory. I know of at least two large companies that have Data Science outfits that no longer sample data, they just go after what is on HDFS and leverage the compute of their Yarn queue. Of course, they have Scala/Spark expertise in house.

View solution in original post

9 REPLIES 9

avatar

I assume you are referring to using Spark's MLlib to train a machine learning model. If so, then I'm betting people are saying that because you have to launch Spark where the client is installed, which is typically on an edge node. The other reason is if they are using Zeppelin to access Spark, then the Zeppelin service and web client would likely be on the management node. However, when you run Spark in cluster modes ("yarn-client" or "yarn-cluster") then the spark job takes advantage of all the Yarn nodes on the cluster. Tuning Spark properly to take advantage of these cluster resources can take some time, and many Spark jobs are not properly tuned. Hope that helps, and that I've understood the question.

avatar

@Sunile Manjee can you please elaborate on which component is being used for Model Training, R, Python, Spark, Mahout? I confirm @Paul Hargis answer about the Spark client typically being done on the edge node. But the actual model training would happen in distrubuted fashion across your HDP cluster.

avatar
Master Guru

@azeltov @Paul Hargis R is used for model training. Can this be done distributed fashion across your HDP cluster? Right now I am seeing a few data scientist run strictly on edge node and not utilizing their spark data nodes. What am i missing? it is difficult to get all the R libraries on each spark data node? If if R libraries are pushed to data nodes will the model training run distributed mode? will it run parallel execution?

avatar

In order to run properly on cluster (using one of the 2 described cluster modes), Spark needs to distribute any extra jars that are required at runtime. Normally, the Spark driver sends required jars to the nodes for use by the executors, but that doesn't happen by default for user-supplied or third-party jars (via import statements). Therefore, you have to set one or two parameters, depending on whether the driver and/or the executors need those libs:

# Extra Classpath jars
spark.driver.extraClassPath=/home/zeppelin/notebook/jars/guava-11.0.2.jar
spark.executor.extraClassPath=/home/zeppelin/notebook/jars/guava-11.0.2.jar

If you are not sure, set both. Finally, the actual jar files should be copied to the specified location. If on the local filesystem, you will have to copy to each node's local fs. If you reference from HDFS, then a single copy will suffice.

avatar
Guru

Isn't the issue that Spark R support simply honors the syntax of R and still relies on Spark MLLib for any distributed processing? I don't believe R libraries were designed to run in distributed mode. I think as Scala gains more prominence with the data science community, Spark MLLib acquires more algorithms, and Zeppelin acquires more useful visualizations, much more data science will be done directly on the cluster. The advantages are self explanatory. I know of at least two large companies that have Data Science outfits that no longer sample data, they just go after what is on HDFS and leverage the compute of their Yarn queue. Of course, they have Scala/Spark expertise in house.

avatar
Master Guru

@Vadim Are suggesting R can not run in distributed mode?

avatar
Guru

Not generally. You can use a framework like rmr or foreach to simulate parallelism but it there is a bunch of extra work required. Most data scientists just run R on their workstation and sample data to fit within the resource constraints of their system. Once Spark catches up in terms of available algorithms I think more and more data science will be done on the cluster. Check out this blog about the different modes of R http://blog.revolutionanalytics.com/2015/06/using-hadoop-with-r-it-depends.html

avatar
Master Guru
@Vadim

thanks for sharing. This article is about R on hadoop. I am intersted in R on Spark.

avatar
Guru

You asked why Data Science teams claim that they cannot do most of their work on the cluster with R. My point is that it is due to the fact that R is mainly a client side studio, not that different from Eclipse (but with more tools). The article I suggested points out the hoops you have to jump through to run R across a cluster. Spark R does not really address this just yet since Spark R is simply an R interpreter that turns the instructions sets into RDDs and then executes them as Spark on the cluster. Spark R does not actually use any of the R packages to exectue logic. Take a look at the Spark R page (https://spark.apache.org/docs/1.6.0/sparkr.html) it mainly talks about creating data frames using R syntax. The section on machine learning covers Gaussian and Binomial GLM, thats it, that is Spark R at this point. If the requirements of your project can be satisfied with using these techniques then great, you can now do your work on the cluster. If not, you will need to learn Spark and Scala. Until Spark has all of the functions and algorithms that R is capable of, Spark R will not completely solve the problem. That is why data scientist that do not have a strong dev background continue to sample data to make it fit on their workstation, so that they can continue to use all of the packages that R provides.