There is great excitement about Cloud Computing and this is compounded by the Covid-19 pandemic. Auto-Scaling in the cloud, from adding resources when needed to terminating them when finished, provides great flexibility for computing resources on demand. It also makes cloud computing financially viable. A recent survey showed spending 23% over budget (n=750, Flexera, "State of the Cloud").
Cloudera Data Platform (CDP) lets you scale to the power you need and helps you avoid this over-budget spending with Auto-Scaling for Data Warehousing and Machine Learning using Kubernetes. We're going to dive into the details of how to set up an Auto-Scaling environment in Cloudera Machine Learning (CML)
CDP provides self-service provisioning of Workspaces. A Workspace can run Python, Scala, R, Spark, Tensorflow, etc. This example will show a Spark Auto-Scale environment. You can run Spark code here and not do any data science if you choose. Please don't let the name "Cloudera Machine Learning" stop you from using this Spark environment because you're not doing any ML.
The following illustration shows setting up an example Workspace, we can zoom in on the details in the next image:
The Advanced Options include the definition of the resources to be pre-started and limits on the total auto-scale. The configuration below starts up 5 AWS EC2-EKS instances for the users of this Workspace. It also starts up 5 GPU instances. As users run jobs that exceed these resources, additional nodes will be added to the Workspace. I've set a limit here of 30 nodes but you can go much higher as needed. To keep costs to a minimum you would set the initial number of servers to a lower number and cap the total scale up to what your budget allows. The tradeoffs of speed vs budget vs Service Level Agreements (SLAs) are important discussions to have with your Finance, IT, and User communities.
Once the Workspace is up and running, users can log in with their single sign-on and use the environment. In the interest of cost control (yes, we have a wall-of-shame for big cloud spenders and I don't want to be on it), I've set my initial instance count to 1 for the examples below.
We see here in the Cloudera Machine Learning dashboard, our initial allocation is 16 virtual CPUs (vCPUs ) and 60 Gigabytes of memory (GiB). The Green indicates the resources assigned to our login, the blue shows the total resources used across all users and deployed models in the Workspace.
Here is some PySpark code that requests resources. We want 12 executors for our code and each executor should get 18GiB of memory. The total memory required is 216GiB which is much greater than the 60GiB currently available.
Our SparkSession will start, the code will start to run, but there are pods that cannot be scheduled.
The following is from the Cloudera Docs:
"...if the scheduler cannot find a node to schedule an engine pod because of insufficient CPU or memory, the engine pod will be in “pending” state. When the autoscaler notices this situation, it will change the desired capacity of the autoscaling group (CPU or GPU) to provision a new node in the cluster. As soon as the new node is ready, the scheduler will place the session or engine pod there..."
Having launched the PySpark code, the autoscaler detects the pending pods and gets to work. This screenshot illustrates two EC2 nodes spinning up (in the Pending state) that will be allocated to Amazon's Elastic Kubernetes Service (EKS). In the foreground, the CML dashboard shows the net impact of two additional m5.4xlarge nodes. There are still pods that can't be scheduled, so we do another scale-up.
To satisfy the PySpark request, an additional increment has been added to the workspace. We now have 300+GiB memory to work with.
We've done two steps instead of one to scale up to the required resources. By using two steps, we have optimized cost control. The PySpark code starts to run before any scale-up. If the Spark Session requests more than it really needs, doing the full scale-up immediately is wasteful. I'm sure no one in your organization would ever ask for more CPU/Memory than needed, but it does occasionally happen in our industry. The PySpark example above has some trivial code that doesn't need lots of resources. It will run to completion without any scale up at all. I controlled the session to force the scale-up to for this blog post.
Scaling down is a critical component of cost control. When the Spark Session is over, the autoscaler will return to the resource level specified in the Workspace definition. The illustration below shows the instances of shutting-down/terminating: