Created on 06-05-202308:45 PM - edited on 06-08-202302:43 AM by VidyaSargur
The Cloudera Data Platform (CDP) is a hybrid data platform designed to deliver faster and easier data management, analytics and AI at enterprise scale. Cloudera Machine Learning (CML) is one of CDP’s Cloud Native Data Services designed to enable secure, governed Data Science.
With immediate access to enterprise data pipelines, scalable compute resources, and access to preferred tools, Data Scientists and Engineers use CML to streamline the process of getting analytic workloads into production and intelligently manage machine learning use cases and MLOps processes.
While CML Data Scientists spend most of their time prototyping and productionizing models in CML Projects, the CML Admin needs to be familiar with the CML Workspace, its basic architecture and how it allocates Cloud resources.
With this article we will share some foundational concepts in order to help CML Administrators better understand how to size Workspaces.
What is a Workspace?
CML workloads are executed within Workspaces and in turn within Projects and Teams. To the CML User, the Workspace is a high-level construct to create CML Projects, store CML Runtimes, and perform other administrative tasks such as creating Resource Profiles.
However, under the covers, the Workspace is better defined as an auto-scaling service for Data Science leveraging Kubernetes. The Kubernetes cluster runs in Cloud Virtual Machines using AKS and EKS in the Public Cloud or OCP and Cloudera ECS in the Private Cloud. The CML Administrator or Data Scientist is not required to know or handle Kubernetes in any way. CML automatically deploys and manages the infrastructure resources for you in your CDP Environment of choice.
When a Workspace is created for the first time a node is deployed to the underlying infrastructure. This is a fixed resource that is required to run at all times for a small cost.
Subsequently, when a CML User runs a workload such as a Notebook, a Model API Endpoint, or a Batch Job, the CML Workspace provisions the necessary Pod(s) thus requesting a second node from the underlying infrastructure.
As mentioned above, the auto-scaling process is fully automated and does not require any supervision. Auto-scaling events are fast and designed so that CML Users are not aware of them. Running workloads are not affected by the auto-scaling event e.g. running Sessions will continue undisturbed. If needed, any pending workloads such as new CML Sessions or previously scheduled CML Jobs will be queued automatically until new resources are deployed.
At a high level, the pods carve out resources from the node(s) which is then released when the workload is complete. Thus, the CML Customer is only charged on the go as cloud resources are consumed and then discarded.
The CML User explicitly picks the amount of CPU, Memory and optionally GPU resources when launching the workload. This amount is called a Resource Profile (e.g. 1 CPU / 2 GiB Mem) and it is predefined by the CML Admin at the Workspace level in order to provide an approval process and prevent Data Scientists from consuming too many resources without control.
When deploying the Workspace for the first time, the user is prompted to select an instance type and an Autoscale Range (see image below). In the Public Cloud, these are AWS or Azure instances. The Autoscale Range is simply a min and max boundary of the instances that can be deployed by the Service.
Typically, the more CPU, Memory, and GPU resources available per instance, the higher the hourly cost to run them but the more CML workloads can be deployed per instance without requiring the autoscaler to deploy an additional node.
Because a typical workload such as a Data Exploration Notebook only requires a small Resource Profile, it is not uncommon to have multiple users working concurrently within the same node and thus at a fairly limited hourly cost. This means that instance types of relatively small size can be chosen when deploying a workspace. In the event of more horsepower being required, the Workspace will simply autoscale by adding as many instances as required and allowed by the Workspace Autoscale Range.
However, if you plan on running workloads that cannot horizontally scale in a distributed fashion with frameworks such as Spark, TensorFlow, etc., then it may make sense to choose a more powerful instance type. This could be the case in Time Series Machine Learning where algorithms cannot always be distributed.
Finally, it’s important to note that CML Instance Types and autoscale ranges can be changed even after a Workspace has been deployed.
Cost Management Considerations
Instance hourly rates are publicly available on the Cloudera Pricing Site. In addition, your Cloudera Account Team can provide additional recommendations to plan and size your Workspace according to your use cases.
CML is designed to allow the Administrator to closely monitor and limit usage in order to prevent runaway cloud charges. As mentioned above, Resource Profiles are whitelisted by the CML Admin in order to prevent CML Users from requesting resources without supervision. To be specific, the CML User will only be able to launch Jobs, Sessions, Applications, etc. with the CPU/Mem/GPU profiles designated in the Runtime menu as shown below.
Furthermore, CML Users are also users at the CDP Environment level. In other words, each Workspace can grant or deny access to a particular CDP User.
Finally, within each Workspace, the CML Admin can create Quotas to directly limit a User’s maximum amount of CPU, Memory, and GPU use across all workloads at any given time. Quota consumption is only a subset of the Workspace Autoscale ranges which can be viewed as a second option for managing costs at the global level.
Using Multiple Workspaces
It is common practice to create multiple CML Workspaces as each additional Workspace can provide workload isolation and a quick second option in case of failure. CML Customers typically deploy them based on scope such as Use Case, Business Organization, or function e.g. DEV vs QA vs PROD.
The additional workspace(s) can be created in the same CDP Environment or in a separate CDP Environment. In the former case, the Workspaces will share the same SDX Data Lake and thus their users will be able to access and transform the same datasets while being governed and secured by the same Atlas and Ranger services. In the latter case, creating Workspaces in different CDP Environments will guarantee that they won’t be adversely affected in case of a failure at the CDP Environment level.
For example, the below image shows two workspaces deployed in the same CDP Environment while a third one is in a separate one. Notice the first Workspace is undergoing a change in instance types and autoscale range.
Additionally, CML supports MLFlow Registry which allows you to deploy models from one Workspace to another. As a result, multiple workspaces can support DevOps pipelines across multiple CDP Environments and even allow you to deploy models from Public to Private Cloud and vice versa (Hybrid Machine Learning).
Although each Workspace comes with a small fixed hourly charge, another advantage is that you will be able to select different instance types and autoscale ranges for each deployment which in turn could allow you to save money by enforcing stricter limitations on particular business functions or user groups.
A Sizing Exercise Example
With all these considerations in mind, we recommend you go through a similar exercise as below when planning your Workspace deployment.
Step 1: Estimate the number of CML Users and optionally whether these will be working within the same or different Teams, Use Cases, and CDP Data Lakes.
Step 2: Estimate average and peak CPU, Memory, and optionally GPU consumption per User. If planning on more than one Team, determine if the average and peak dramatically varies between them.
Step 3: Decide if you need more than one workspace. Try to group users into Teams and Use Cases as much as reasonably possible based on similarities in Data Lake Access, average and peak consumption. Other factors may include whether users need GPUs, special Ranger ACLs, and types of workloads (e.g. primarily hosting API Model Endpoints vs Exploratory Data Science in Notebooks vs Spark ETL in CML Jobs).
Step 4: Sum up all CPU, Memory, and GPU required per workspace at peak and average, then add 20%.
Step 5: Look up CPU, Memory, and GPU resources per AWS or Azure Instance types and estimate how many instances would be required to fit the sum from Step 4. Pick an Instance Type that will fit most of your average workloads with a reasonable instance count (i.e. within the 3-6 range) and your peak workloads with no more than 10 instances. If this is not possible, divide the workload further into two separate workspaces where one has the same or smaller instance types and the other has larger instance types.
In this article, we highlighted some of the most fundamental considerations for sizing a CML Workspace. In summary:
CML Workspaces are autoscaling Kubernetes clusters providing Workload Isolation. CML automatically deploys and manages the infrastructure resources for you and requires no knowledge or interaction with the Kubernetes resources under the hood.
When planning for the deployment of Workspaces it is important to keep in mind that multiple Workspaces can and should be deployed based on Use Case, Team, Function, and Resource Consumption estimates.
Generally, sizing a Workspace consists of an exercise of estimating average and peak consumption in terms of CML Resource Profiles and mapping the estimates to AWS or Azure Instance Types. Additional considerations such as workload type, Data Lake access and SLAs should be prioritized as decision factors.