Engineering Blogs

Announcements
Now Live: Explore expert insights and technical deep dives on the new Cloudera Community BlogsRead the Announcement

Ozone White-Box Testing with Byteman: A Deep Dive into Fault Injection

avatar
Contributor

Contributors: Soumitra, Swami, Rishabh, Uma, Arpit

Background

White-box testing is essential for distributed file systems like Apache Ozone, ensuring data consistency, replication, and fault tolerance by exposing race conditions, concurrency issues, and failure scenarios. Unlike black-box testing, which focuses on system behavior without internal insights, white-box testing provides a deeper understanding of internal mechanisms, helping detect and resolve complex synchronization issues that are otherwise hard to reproduce.

Why White-Box Testing for Ozone?

Most downstream Ozone integration tests rely on black-box testing, which makes it difficult to automate synchronization events or reproduce escalations seen in production. White-box testing addresses this challenge by enabling precise fault injection and deterministic debugging.

Early efforts used gdb to manipulate Java process threads, successfully reproducing the silent data corruption issue. However, automation was difficult with gdb. This motivated the adoption of Byteman, a Java agent that enables runtime fault injection and controlled execution flow manipulation.

By integrating Byteman into Ozone test suites, we can:

  • Systematically test synchronization paths.
  • Improve failure recovery validation.
  • Proactively detect hidden issues before they impact production.

VidyaSargur_0-1766040714199.png

 

 

Introduction to Byteman

Byteman is a powerful tool that allows developers and testers to inject custom behavior into Java applications at runtime. It requires no recompilation or code changes, making it ideal for fault injection, testing, and debugging.

Key Features

  • Runtime Code Injection — dynamically alter Java methods.
  • Rule-Based Fault Injection — introduce deterministic faults.
  • JVM Compatibility — works across JVM-based services, including containerized environments.
  • Testing & Debugging — validate resilience by injecting controlled failures.

Benefits for Ozone

  • Simulate real-world failures without custom builds or patches.
  • Improve code coverage by introducing edge cases dynamically.
  • Verify error-handling logic under controlled fault scenarios.
  • Optimize performance testing by stressing internal functions.

Byteman Rule Structure

A Byteman rule is the basic unit of fault injection. Each rule defines when, where, and what to inject at runtime. The general syntax looks like this:

RULE 
CLASS
METHOD
AT
IF
DO

ENDRULE

Breakdown of Components:

  • RULE - A descriptive name for the fault scenario.
  • CLASS - Fully qualified Java class where the fault is injected.
  • METHOD - Method to intercept (constructors can also be targeted with ).
  • AT - Injection point (e.g., ENTRY, EXIT, or specific line numbers).
  • IF - Boolean condition to decide when the rule triggers (can use variables, method arguments, or always TRUE).
  • DO - The code snippet to execute, such as throwing exceptions, changing return values, or logging.
  • ENDRULE — Marks the end of the rule.

Example Rule Definitions

"SkipPutBlock": textwrap.dedent("""\\
RULE Block putBlock
CLASS org.apache.hadoop.ozone.container.keyvalue.impl.BlockManagerImpl
METHOD putBlock
AT ENTRY
IF TRUE
DO
System.out.println("[" + java.time.LocalDateTime.now() + "] BYTEMAN: Blocking putBlock in BlockManagerImpl");
return 0;
ENDRULE
"""),

These rules allow us to skip execution paths in Ozone internals, helping reproduce and validate tricky failure scenarios.

Byteman Integration in Ozone Acceptance Tests

As part of HDDS-13251 and the associated Apache Ozone PR #8783, Byteman was integrated directly into acceptance tests.

For example, a new Robot test suite container-state-verifier.robot was added that:

  • Injects Byteman rules to override ContainerData.getState().
  • Runs ozone debug replicas verify --container-state.
  • Asserts expected state transitions (UNHEALTHY, DELETED, INVALID).
  • Cleans up the rules after the test.

This demonstrates how Byteman fault injection can be combined with CLI-based checks for systematic validation.

Adding a New Acceptance Test with Byteman

If you want to add your own acceptance test with Byteman, follow this pattern:

1. Create a Byteman Rule Template

dev-support/byteman/myfault-template.btm

RULE Override Method Behavior
CLASS com.mycompany.package.MyClass
METHOD myMethod
AT ENTRY
IF TRUE
DO
traceln("BYTEMAN RULE: Overriding myMethod() to return custom value");
return "FAULT_MODE"
ENDRULE

2. Write a Robot Test File

hadoop-ozone/dist/src/main/smoketest/myfault/myfault-verifier.robot

*** Variables ***
${TEMPLATE_RULE} /opt/hadoop/share/ozone/byteman/myfault-template.btm
*** Keywords ***
Verify Behavior With Fault
Add Byteman Rule ${FAULT_INJ_DATANODE} ${TEMPLATE_RULE}
${output} = Execute My Ozone CLI Command
Should Contain ${output} EXPECTED_OUTPUT
Remove Byteman Rule ${FAULT_INJ_DATANODE} ${TEMPLATE_RULE}
*** Test Cases ***
Verify Custom Fault Mode
Verify Behavior With Fault TEST_VALUE

3. Hook Into the Test Runner

Modify the runner script (compose/common/.sh):

execute_robot_test ${OM} \\
-v "PREFIX:${prefix}" \\
-v "DATANODE:${host}" \\
-v "FAULT_INJ_DATANODE:${datanode}" \\
smoketest/myfault/myfault-verifier.robot

That''''s it! Your new test will:

  • Inject custom runtime faults.
  • Validate Ozone''''s handling of failure paths.
  • Run seamlessly in CI pipelines.

Case Study Example

One real-world bug caught through Byteman was a replica not found scenario. By forcing inconsistent container states, the acceptance test reproduced the failure and validated recovery paths.

  • Root Cause: Replica state not reconciled after volume failure.
  • Impact: Risk of data loss or silent corruption.
  • Resolution: Byteman tests exposed the race condition, enabling targeted fixes before production impact.

Challenges with Byteman

  • Requires deeper knowledge of internal code flows.
  • Rules are text-based and not compile-checked with source changes.
  • Close collaboration between test engineers and developers is critical to maintain accuracy.

Conclusion

Byteman-powered white-box testing has significantly improved Ozone's resilience validation. With automated rule injection, acceptance test coverage now includes synchronization failures, replica mismatches, and recovery logic — all within CI/CD pipelines.

As Ozone evolves, this framework ensures that hidden issues are caught earlier, customer escalations are reduced, and the system continues to meet strict reliability guarantees.

Call for Contributions

Fault injection in Ozone is a community-driven effort. We encourage contributors to:

  • Add new Byteman rules targeting untested failure scenarios.
  • Write acceptance tests that validate recovery paths.
  • Propose tooling enhancements for easier rule management and visualization.
  • Share real-world failure cases that can inspire test coverage.
Upstream Tracker: labels=ozone-fi

If you're interested in strengthening Ozone's reliability story, jump into the Apache Ozone JIRA board or explore open PRs on GitHub. Contributions, big or small, will directly improve the system's robustness and help the entire community.

Glossary

  1. Byteman-rule-language
  2. Byteman in upstream: ozone/pull/8654
  3. Reference PRs: ozone/pull/8783 | ozone/pull/8810

Ozone White-Box Testing with Byteman: A Deep Dive into Fault Injection was originally published in Engineering@Cloudera on Medium, where people are continuing the conversation by highlighting and responding to this story.