Created on
10-27-2025
11:51 PM
- last edited on
12-18-2025
08:19 AM
by
VidyaSargur
Contributors: Soumitra, Swami, Rishabh, Uma, Arpit
White-box testing is essential for distributed file systems like Apache Ozone, ensuring data consistency, replication, and fault tolerance by exposing race conditions, concurrency issues, and failure scenarios. Unlike black-box testing, which focuses on system behavior without internal insights, white-box testing provides a deeper understanding of internal mechanisms, helping detect and resolve complex synchronization issues that are otherwise hard to reproduce.
Most downstream Ozone integration tests rely on black-box testing, which makes it difficult to automate synchronization events or reproduce escalations seen in production. White-box testing addresses this challenge by enabling precise fault injection and deterministic debugging.
Early efforts used gdb to manipulate Java process threads, successfully reproducing the silent data corruption issue. However, automation was difficult with gdb. This motivated the adoption of Byteman, a Java agent that enables runtime fault injection and controlled execution flow manipulation.
By integrating Byteman into Ozone test suites, we can:
Byteman is a powerful tool that allows developers and testers to inject custom behavior into Java applications at runtime. It requires no recompilation or code changes, making it ideal for fault injection, testing, and debugging.
A Byteman rule is the basic unit of fault injection. Each rule defines when, where, and what to inject at runtime. The general syntax looks like this:
RULE
CLASS
METHOD
AT
IF
DO
ENDRULE
"SkipPutBlock": textwrap.dedent("""\\
RULE Block putBlock
CLASS org.apache.hadoop.ozone.container.keyvalue.impl.BlockManagerImpl
METHOD putBlock
AT ENTRY
IF TRUE
DO
System.out.println("[" + java.time.LocalDateTime.now() + "] BYTEMAN: Blocking putBlock in BlockManagerImpl");
return 0;
ENDRULE
"""),
These rules allow us to skip execution paths in Ozone internals, helping reproduce and validate tricky failure scenarios.
As part of HDDS-13251 and the associated Apache Ozone PR #8783, Byteman was integrated directly into acceptance tests.
For example, a new Robot test suite container-state-verifier.robot was added that:
This demonstrates how Byteman fault injection can be combined with CLI-based checks for systematic validation.
If you want to add your own acceptance test with Byteman, follow this pattern:
dev-support/byteman/myfault-template.btm
RULE Override Method Behavior
CLASS com.mycompany.package.MyClass
METHOD myMethod
AT ENTRY
IF TRUE
DO
traceln("BYTEMAN RULE: Overriding myMethod() to return custom value");
return "FAULT_MODE"
ENDRULE
hadoop-ozone/dist/src/main/smoketest/myfault/myfault-verifier.robot
*** Variables ***
${TEMPLATE_RULE} /opt/hadoop/share/ozone/byteman/myfault-template.btm
*** Keywords ***
Verify Behavior With Fault
Add Byteman Rule ${FAULT_INJ_DATANODE} ${TEMPLATE_RULE}
${output} = Execute My Ozone CLI Command
Should Contain ${output} EXPECTED_OUTPUT
Remove Byteman Rule ${FAULT_INJ_DATANODE} ${TEMPLATE_RULE}
*** Test Cases ***
Verify Custom Fault Mode
Verify Behavior With Fault TEST_VALUE
Modify the runner script (compose/common/.sh):
execute_robot_test ${OM} \\
-v "PREFIX:${prefix}" \\
-v "DATANODE:${host}" \\
-v "FAULT_INJ_DATANODE:${datanode}" \\
smoketest/myfault/myfault-verifier.robot
That''''s it! Your new test will:
One real-world bug caught through Byteman was a replica not found scenario. By forcing inconsistent container states, the acceptance test reproduced the failure and validated recovery paths.
Byteman-powered white-box testing has significantly improved Ozone's resilience validation. With automated rule injection, acceptance test coverage now includes synchronization failures, replica mismatches, and recovery logic — all within CI/CD pipelines.
As Ozone evolves, this framework ensures that hidden issues are caught earlier, customer escalations are reduced, and the system continues to meet strict reliability guarantees.
Fault injection in Ozone is a community-driven effort. We encourage contributors to:
Upstream Tracker: labels=ozone-fi
If you're interested in strengthening Ozone's reliability story, jump into the Apache Ozone JIRA board or explore open PRs on GitHub. Contributions, big or small, will directly improve the system's robustness and help the entire community.
Ozone White-Box Testing with Byteman: A Deep Dive into Fault Injection was originally published in Engineering@Cloudera on Medium, where people are continuing the conversation by highlighting and responding to this story.