Engineering Blogs

Under the Hood: How We Built Data-Driven Recommendations for Spark Resources?

avatar
Cloudera Employee

I. Introduction: The Over-Provisioning Paradox

Have you ever hit "Submit" on a Spark job and felt that nagging urge to double the executor memory just in case?

You’re not alone. As developers, our primary KPI is successful delivery. If a job fails at 3:00 AM because of an OutOfMemory error, that’s a direct hit to our productivity. To sleep better at night, we adopt a "Safety First" strategy: we request the maximum possible CPU and memory we think the job could ever need, or simply the maximum the cluster policy allows.

But here’s the reality: those "just in case" resources usually sit idle. They are reserved, billed, and locked away from other workloads, but never actually used. When you multiply this "resource hoarding" across thousands of jobs and multi-tenant clusters, you aren't just over-budget, you're creating massive scheduling bottlenecks.

At Cloudera, we realized that developers don't over-provision because they want to be wasteful; they do it because they lack visibility. You can't optimize what you can't measure. That’s why we built the Resource Efficiency Analysis feature in Cloudera Observability. We wanted to close the gap between what you asked for and what your Spark executors actually consumed.

In this post, we’re going to dive into how we calculate Spark resource wastage metrics under the hood.

II. Part 1: How We Calculate the "True" Spark Footprint

To fix resource wastage, we first had to define it. In Cloudera Observability, we use a deceptively simple formula:

Resource Wastage = Requested Resources - Used Resources

To make this actionable for platform teams and "FinOps" enthusiasts, we standardize these across all engines into two universal units: Core-Millis and GB-Millis. By attaching a dollar value to these units in your chargeback settings, we can turn idle CPU cycles directly into a "Potential Savings" figure.

But how do we get these numbers accurately? Let’s look under the hood.

Since Spark is a distributed engine with long-lived executors, calculating "Allocation" is more than just looking at one number.

Calculating the "Allocated" Floor

We calculate the total resource "rent" by summing the durations of every executor and multiplying by their individual footprints.

  • CPU Allocation: Sum(Executor Durations) * Allocated Cores
  • Memory Allocation: This is where it gets tricky. If you only look at spark.executor.memory, you’re missing the bigger picture. We calculate the true footprint by summing: spark.executor.memory + spark.executor.memoryOverhead + spark.executor.pyspark.memory + spark.memory.offHeap.size

Calculating the "Used" Reality

CPU Usage Calculation

For CPU usage, we aggregate task-level metrics like ExecutorRunTime, ExecutorDeserializeTime, and ResultSerializationTime across all tasks to calculate the actual utilisation of CPUs allocated to executors.
cpu_allocation_and_usage.png

The Logic For Peak Memory Computation

We don't just look at the JVM heap. If you're using PySpark or off-heap memory, a simple heap analysis would give you a dangerously low recommendation.

We analyze the Resident Set Size (RSS) from the process tree. This captures the total physical memory footprint of the executor, including the JVM, Python sidecars, and the overhead. By basing our recommendation on the actual peak footprint of the entire process, we ensure the value we give you is grounded in physical reality.

 

memory_allocation_and_usage.png

III. Part 2: From Data to Action (Spark Value Recommendations)

Identifying "wastage" is a great first step, but let’s be honest: as an engineer, you don't just want to know you're wasting 40GB of RAM, you want to know the exact config changes to fix the wastage so you can move on with your day.

We realized that for our recommendations to be useful, they had to be prescriptive. We shifted from "you're using too much" to "change this value to X." Here is how we built that recommendation engine.

1. The "Safety First" Algorithm

The biggest fear in down-sizing any job is the dreaded OutOfMemory (OOM) error. If our tool recommends a value that causes your job to crash, we’ve failed.

To prevent this, we built a Safety Factor into our logic. We don't just match your peak usage; we add a 20% buffer (1.2x multiplier).

  • For Underutilized Jobs (<60% memory used): We look at your peak memory usage across the run, apply the safety buffer, and suggest a lower value.
  • For Overutilized Jobs (>90% memory used): If you're redlining, we proactively suggest an increase (usually in 1GB increments) to prevent future instability, even if the job succeeded this time.

2. Knowing When to Stay Quiet (Guardrails)

A good mentor knows when not to give advice. Our recommendation engine includes several "anti-patterns" where we intentionally withhold a recommendation:

  • Data Skew: If one executor is at 90% while others are at 10%, decreasing resources would break the "straggler" executor. We detect the skew and tell you to fix the partition logic instead.
  • High GC Overhead: If your job is spending 30% of its time in Garbage Collection, reducing memory, even if usage looks low, is a recipe for disaster.
  • Shuffle Spills: If we see your job is spilling significant data to disk, we won't suggest a memory reduction.

IV. Real-World Example

Imagine you have a Spark job configured with 8GB per executor.

  • Our Analysis: We see that across the entire run, the peak RSS memory never crossed 4GB.
  • The Recommendation: Cloudera Observability will explicitly suggest to set spark.executor.memory to 4g (Includes 20% safety buffer).

    Recommendation_screenshot.jpg

     

You get a faster-scheduling job (due to lower resource demand), the cluster gets 4GB back (per executor) for other users, and the CFO stays happy. It’s a win-win-win.

V. What’s Next: Learning from the Past

Currently, our recommendations are based on the most recent run. But we know that data volumes fluctuate. Maybe your "Monday morning" run is 5x larger than your "Tuesday" run.

Our next evolution is Historical Trend Analysis. We are working on looking at the past runs to provide a recommendation that is "seasonally aware," ensuring your settings are robust enough for month-end processing while remaining efficient for daily tasks.