Member since
06-26-2018
28
Posts
2
Kudos Received
3
Solutions
My Accepted Solutions
| Title | Views | Posted |
|---|---|---|
| 4300 | 10-22-2019 09:24 AM | |
| 3104 | 10-29-2018 02:28 PM | |
| 13087 | 10-08-2018 08:36 AM |
10-27-2025
11:51 PM
Contributors: Soumitra, Swami, Rishabh, Uma, Arpit
Background
White-box testing is essential for distributed file systems like Apache Ozone, ensuring data consistency, replication, and fault tolerance by exposing race conditions, concurrency issues, and failure scenarios. Unlike black-box testing, which focuses on system behavior without internal insights, white-box testing provides a deeper understanding of internal mechanisms, helping detect and resolve complex synchronization issues that are otherwise hard to reproduce.
Why White-Box Testing for Ozone?
Most downstream Ozone integration tests rely on black-box testing, which makes it difficult to automate synchronization events or reproduce escalations seen in production. White-box testing addresses this challenge by enabling precise fault injection and deterministic debugging.
Early efforts used gdb to manipulate Java process threads, successfully reproducing the silent data corruption issue. However, automation was difficult with gdb. This motivated the adoption of Byteman, a Java agent that enables runtime fault injection and controlled execution flow manipulation.
By integrating Byteman into Ozone test suites, we can:
Systematically test synchronization paths.
Improve failure recovery validation.
Proactively detect hidden issues before they impact production.
Introduction to Byteman
Byteman is a powerful tool that allows developers and testers to inject custom behavior into Java applications at runtime. It requires no recompilation or code changes, making it ideal for fault injection, testing, and debugging.
Key Features
Runtime Code Injection — dynamically alter Java methods.
Rule-Based Fault Injection — introduce deterministic faults.
JVM Compatibility — works across JVM-based services, including containerized environments.
Testing & Debugging — validate resilience by injecting controlled failures.
Benefits for Ozone
Simulate real-world failures without custom builds or patches.
Improve code coverage by introducing edge cases dynamically.
Verify error-handling logic under controlled fault scenarios.
Optimize performance testing by stressing internal functions.
Byteman Rule Structure
A Byteman rule is the basic unit of fault injection. Each rule defines when, where, and what to inject at runtime. The general syntax looks like this:
RULE CLASS METHOD AT IF DO ENDRULE
Breakdown of Components:
RULE - A descriptive name for the fault scenario.
CLASS - Fully qualified Java class where the fault is injected.
METHOD - Method to intercept (constructors can also be targeted with ).
AT - Injection point (e.g., ENTRY, EXIT, or specific line numbers).
IF - Boolean condition to decide when the rule triggers (can use variables, method arguments, or always TRUE).
DO - The code snippet to execute, such as throwing exceptions, changing return values, or logging.
ENDRULE — Marks the end of the rule.
Example Rule Definitions
"SkipPutBlock": textwrap.dedent("""\\ RULE Block putBlock CLASS org.apache.hadoop.ozone.container.keyvalue.impl.BlockManagerImpl METHOD putBlock AT ENTRY IF TRUE DO System.out.println("[" + java.time.LocalDateTime.now() + "] BYTEMAN: Blocking putBlock in BlockManagerImpl"); return 0; ENDRULE """),
These rules allow us to skip execution paths in Ozone internals, helping reproduce and validate tricky failure scenarios.
Byteman Integration in Ozone Acceptance Tests
As part of HDDS-13251 and the associated Apache Ozone PR #8783, Byteman was integrated directly into acceptance tests.
For example, a new Robot test suite container-state-verifier.robot was added that:
Injects Byteman rules to override ContainerData.getState().
Runs ozone debug replicas verify --container-state.
Asserts expected state transitions (UNHEALTHY, DELETED, INVALID).
Cleans up the rules after the test.
This demonstrates how Byteman fault injection can be combined with CLI-based checks for systematic validation.
Adding a New Acceptance Test with Byteman
If you want to add your own acceptance test with Byteman, follow this pattern:
1. Create a Byteman Rule Template
dev-support/byteman/myfault-template.btm
RULE Override Method Behavior CLASS com.mycompany.package.MyClass METHOD myMethod AT ENTRY IF TRUE DO traceln("BYTEMAN RULE: Overriding myMethod() to return custom value"); return "FAULT_MODE" ENDRULE
2. Write a Robot Test File
hadoop-ozone/dist/src/main/smoketest/myfault/myfault-verifier.robot
*** Variables *** ${TEMPLATE_RULE} /opt/hadoop/share/ozone/byteman/myfault-template.btm
*** Keywords ***
Verify Behavior With Fault Add Byteman Rule ${FAULT_INJ_DATANODE} ${TEMPLATE_RULE} ${output} = Execute My Ozone CLI Command Should Contain ${output} EXPECTED_OUTPUT Remove Byteman Rule ${FAULT_INJ_DATANODE} ${TEMPLATE_RULE}
*** Test Cases *** Verify Custom Fault Mode Verify Behavior With Fault TEST_VALUE
3. Hook Into the Test Runner
Modify the runner script (compose/common/.sh):
execute_robot_test ${OM} \\ -v "PREFIX:${prefix}" \\ -v "DATANODE:${host}" \\ -v "FAULT_INJ_DATANODE:${datanode}" \\ smoketest/myfault/myfault-verifier.robot
That''''s it! Your new test will:
Inject custom runtime faults.
Validate Ozone''''s handling of failure paths.
Run seamlessly in CI pipelines.
Case Study Example
One real-world bug caught through Byteman was a replica not found scenario. By forcing inconsistent container states, the acceptance test reproduced the failure and validated recovery paths.
Root Cause: Replica state not reconciled after volume failure.
Impact: Risk of data loss or silent corruption.
Resolution: Byteman tests exposed the race condition, enabling targeted fixes before production impact.
Challenges with Byteman
Requires deeper knowledge of internal code flows.
Rules are text-based and not compile-checked with source changes.
Close collaboration between test engineers and developers is critical to maintain accuracy.
Conclusion
Byteman-powered white-box testing has significantly improved Ozone's resilience validation. With automated rule injection, acceptance test coverage now includes synchronization failures, replica mismatches, and recovery logic — all within CI/CD pipelines.
As Ozone evolves, this framework ensures that hidden issues are caught earlier, customer escalations are reduced, and the system continues to meet strict reliability guarantees.
Call for Contributions
Fault injection in Ozone is a community-driven effort. We encourage contributors to:
Add new Byteman rules targeting untested failure scenarios.
Write acceptance tests that validate recovery paths.
Propose tooling enhancements for easier rule management and visualization.
Share real-world failure cases that can inspire test coverage.
Upstream Tracker: labels=ozone-fi
If you're interested in strengthening Ozone's reliability story, jump into the Apache Ozone JIRA board or explore open PRs on GitHub. Contributions, big or small, will directly improve the system's robustness and help the entire community.
Glossary
Byteman-rule-language
Byteman in upstream: ozone/pull/8654
Reference PRs: ozone/pull/8783 | ozone/pull/8810
Ozone White-Box Testing with Byteman: A Deep Dive into Fault Injection was originally published in Engineering@Cloudera on Medium, where people are continuing the conversation by highlighting and responding to this story.
... View more
Labels:
09-28-2020
11:49 AM
1 Kudo
Zookeeper does not allow listing or editing znodes if the current ACL doesn't have a set of permissions for the user or group. This is observed as a security authentication of znodes in all Cloudera Distros inherited from Apache Zookeeper. There are few references for the workaround, just compiling them together for Cloudera Managed clusters.
For the following error:
Authentication is not valid
There are two ways to address them:
Disable any ACL validation in Zookeeper (Not recommended):
Add the following config in CM > Zookeeper config > Search for 'Java Configuration Options for Zookeeper Server': -Dzookeeper.skipACL=yes
Then Restart and refresh the stale configs.
Add a Zookeeper super auth:
Skip the part added in <SKIP> if you want to use ‘password' as the auth key. <SKIP> cd /opt/cloudera/parcels/CDH/lib/zookeeper/
java -cp "./zookeeper.jar:lib/*" org.apache.zookeeper.server.auth.DigestAuthenticationProvider super:password Use the last line from the following output on running the above command : SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
super:password->super:DyNYQEQvajljsxlhf5uS4PJ9R28= </SKIP>
Add the following config in CM > Zookeeper config > Search 'Java Configuration Options for Zookeeper Server': -Dzookeeper.DigestAuthenticationProvider.superDigest=super:DyNYQEQvajljsxlhf5uS4PJ9R28=
Restart and refresh the stale configs.
Once connected to zookeeper-client, add the following command before executing any further command: addauth digest super:password
You will be able to run any operation on any znode post this command.
NOTE:
Version of slf4j-api may differ on later builds.
Update the super password to any string you desire. <password>
... View more
Labels:
10-23-2019
05:33 AM
Thanks for reply @rohitmalhotra User Limit for yarn is set to 65536. Is there any recommended highest value or shall I just make it unlimited? (It can have consequences?) Edit : I tried setting unlimited. Still seeing same error.
... View more
10-22-2019
08:11 PM
I would suggest you to go through the below docs and verify the outbound rules on port 7180. https://docs.aws.amazon.com/vpc/latest/userguide/vpc-network-acls.html
... View more
10-22-2019
12:05 PM
Good news. If that resolves your issue, please spare some time in accepting the solution. Thanks.
... View more
10-22-2019
10:44 AM
Seeing below exception on running Hive TPCDS data gen (https://github.com/hortonworks/hive-testbench) for a scale of ~500G.
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: java.io.IOException: java.lang.OutOfMemoryError: unable to create new native thread
Attached log for complete stacktrace.
Cluster Configuration :
16 Nodes / 12 Nodemanagers / 12 Datanodes
Per Node Config :
Cores : 40
Memory : 392GB
Ambari Configs changed from initial configs to improve performance :
Decided to set 10G as container size to utilise maximum cores per node (320G/10G = 32 containers using 1 Core/node. Hence ~32 Cores/node utilised)
YARN
yarn.nodemanager.resource.memory-mb = 329216 MB
yarn.scheduler.minimum-allocation-mb = 10240 MB
yarn.scheduler.maximum-allocation-mb = 329216 MB
MapReduce (All Heap Sizes : -Xmx8192m : 80% of container)
mapreduce.map.memory.mb = 10240 MB
mapreduce.reduce.memory.mb = 10240 MB
mapreduce.task.io.sort.mb = 1792 MB
yarn.app.mapreduce.am.resource.mb = 10240 MB
Hive
hive.tez.container.size = 10240MB
hive.auto.convert.join.noconditionaltask.size = 2027316838 B
hive.exec.reducers.bytes.per.reducer = 1073217536 B
Tez
tez.am.resource.memory.mb = 10240 MB
tez.am.resource.java.opts = -server -Xmx8192m
tez.task.resource.memory.mb = 10240 MB
tez.runtime.io.sort.mb = 2047 MB (~20% of container)
tez.runtime.unordered.output.buffer.size-mb = 768 MB (~10% of container)
tez.grouping.max-size = 2073741824 B
tez.grouping.min-size = 167772160 B
Any help would be greatly appreciated. Referred https://community.cloudera.com/t5/Community-Articles/Demystify-Apache-Tez-Memory-Tuning-Step-by-Step/ta-p/245279 for some tuning values.
... View more
Labels:
- Labels:
-
Apache Ambari
-
Apache Hive
-
Apache Tez
10-22-2019
09:38 AM
Regular exception is observed in CM server logs : 2019-10-20 17:32:34,687 ERROR ParcelUpdateService:com.cloudera.parcel.components.ParcelDownloaderImpl: (11 skipped) Unable to retrieve remote parcel repository manifest
java.util.concurrent.ExecutionException: java.net.ConnectException: connection timed out: archive.cloudera.com/151.101.188.167:443 This may happen if you have http_proxy to access public web or you have private network. Currently CM is trying to access the archive url to download parcels as this was the method used while instaling CM and failing to do so. Try running below command on CM node and let us know the output : wget https://archive.cloudera.com/cdh6/6.3.1/parcels/manifest.json If you want to set proxy can be done under Administration > Search for 'Proxy'
... View more
10-22-2019
09:24 AM
Can you verify the proper ownership of the cloudera-scm-server-db folder by running below commands : chown -R cloudera-scm:cloudera-scm /var/lib/cloudera-scm-server-db/
chmod 700 /var/lib/cloudera-scm-server-db/
chmod 700 /var/lib/cloudera-scm-server-db/data
service cloudera-scm-server-db start Also verify the selinux status by running sestatus
... View more
02-28-2019
11:51 AM
3 Kudos
Hi, If you don't want SMARTSENSE in your cluster but still it comes as default selected component during install wizard then go through below steps to save yourself some trouble. Tried on HDP Version : 3.0 and 3.1 Goto below path on the ambari-server node: /var/lib/ambari-server/resources/stacks/HDP/3.0/services/SMARTSENSE/metainfo.xml Open the above file in editor mode (e.g. vi) Uncomment or delete the below line [Line 23 may vary in different release] <selection>MANDATORY</selection> After making the above change restart ambari-server and proceed with cluster install wizard. Now SMARTSENSE won't be a mandate component. Thanks for reading.
... View more
Labels:
10-29-2018
04:21 PM
If it worked for you, please take a moment to login and "Accept" the answer.
... View more