Community Articles

Find and share helpful community-sourced technical articles.
Celebrating a Century of Connection: Our Community Reaches 100,000 Members! Thank you!
Cloudera Employee

According to a survey conducted by Kaggle in 2021, Python is still the most commonly used programming language for data science with over 80% of respondents choosing it as their preferred language. However, R continues to be a popular language among data scientists, with over 15% of respondents choosing it as their primary language.


One of the reasons for R's continued popularity is its strong statistical analysis capabilities. R was designed specifically for statistical computing and provides a rich ecosystem of packages for data analysis and visualization. This makes R a powerful tool for data scientists who need to analyze large datasets and perform complex statistical modeling.


In this article, we'll delve into how to deploy R models in CML, highlighting the steps and key considerations to keep in mind when building and deploying models in this environment.


CMLs Model Framework


As a refresher, let's revisit the key concepts of a model in CML. CML's framework allows for maximum flexibility when it comes to deploying models. Here is the reference diagram showing the fundamental concepts of a model. 


Models - Concepts and Terminology


The model artifacts are actually called from within a Python or R script file. Regardless of the runtime used, you will need to embed your prediction logic within a function. The input arguments sent to the CML model are in JSON format. By the time these parameters are ingested by the function within the R script file, it becomes an R list object. This is important to note because this will determine what, if any transformations, need to occur before getting to the prediction step in your code. 


Simple Add Model in R

Let’s start by looking at a deployed model below for a CML model that adds two numbers. In this case, we take the two elements from the function arguments and add them.

R wrapper script

r wrapper add numbers.png

The CML model parameters, or in this case the named list elements are defined when the CML model is deployed.

Deploying the add 'model'


Working with actual prediction models

The example above helps us get started with using an R model in CML. Now let’s look at two model examples with a focus on the R script file and how parameters are ultimately passed to the model object.

For the two models, we deploy below. We’ll be using the Cars93 dataset. 


Simple Linear Regression

In the example below we are using the Cylinders and Weight as features (or independent variables) to predict our dependent variable - MPG.City. You can follow the details here to see how the model was built R-CML in github


 r wrapper lm.png

In this example, you will note that no further transformation is required. The input parameters were passed directly into the prediction step.



Decision Tree Model 

In our final model, we’ve gotten slightly more sophisticated, included more features and now using a decision tree model.

r wrapper dt.png

We trained our model so that it takes R data frame objects as inputs for predictions. Therefore we need the appropriate step to transform our list into a data frame.  


Below we can see how we define the json input format for the model. 



I hope this has given you enough information to go and build your own R models in CML! Happy model building!