Member since
11-27-2022
5
Posts
1
Kudos Received
1
Solution
My Accepted Solutions
| Title | Views | Posted |
|---|---|---|
| 1871 | 01-25-2023 01:05 AM |
12-08-2025
08:07 AM
The Storage Challenge Every Business Faces Enterprises create massive volumes of data—the Global Datasphere reached 149 ZB in 2024 and is projected to grow to 181 ZB by 2025 and 394 ZB by 2028. Multiple industry sources report that ~60–80% of enterprise data becomes "cold" (rarely accessed within months to a year), yet it often remains on expensive primary storage. Meanwhile, many Hadoop-based data lakes still default to 3× replication. Consider this: a company with 100TB of data actually consumes 300TB of storage due to 3x replication. What if you could automatically reduce that 300TB to under 200TB? In this blog, we'll walk through the calculations to show exactly how Cloudera's Ozone Storage Optimizer makes this possible—an intelligent system that converts cold data from expensive replication to storage-efficient Erasure Coding, reducing storage overhead from 200% to just 40–50% (≈50–53% storage reduction vs. 3× replication). Introducing Ozone Storage Optimizer Ozone Storage Optimizer is an automated data lifecycle management feature for Apache Ozone that identifies and converts infrequently accessed data to reduce storage usage. The system continuously analyzes access patterns, applies configurable policies, and seamlessly transitions cold data from 3x replicated storage to Erasure Coding format. EC (Erasure Coding Format) is not only limited to cold data and can be used directly for hot data as well, wherever applicable. EC works optimally when there are heavy sequential access patterns. In some cases, user make a decision based on their workload benchmarks. But when users choose 3-way replication, it will always get 200% overhead. When data has been identified as cold, it makes sense to convert that data to a storage-efficient format while it continues to get comparable speeds with EC format and achieve significant storage savings. Learn more about Apache Ozone's architecture and how it handles multi-protocol storage. How It Reduces Your Physical Storage Needs Different storage formats have different storage overhead: Standard 3x Replication: Uses 3TB of storage for every 1TB of data (200% overhead) Erasure Coding (EC): Uses only 1.4-1.5TB for every 1TB of data (40-50% overhead) Storage Optimizer automatically moves your cold data from expensive replication to storage-efficient Erasure Coding—while maintaining the same data durability and security. Key Benefits 50-60% storage reduction for cold data—immediate storage gains Fully automated—set it once and forget it Zero application changes—completely transparent to your systems Enterprise security—maintains all compliance and security controls No performance impact—doesn't affect your running workloads Real-World Impact: The Detailed Calculation Let's walk through a practical example step by step. Consider an organization with 500TB of physical storage capacity: Before Optimization: Logical data: 100TB (your actual data) Physical storage consumed: 100TB × 3 (replication factor) = 300TB Remaining capacity: 500TB - 300TB = 200TB After Optimization (assuming 70% of data becomes cold): Let's break down how the data is stored after optimization: Hot Data (30TB logical): Still uses 3x replication Physical storage: 30TB × 3 = 90TB Cold Data (70TB logical): Converted to Erasure Coding (1.5x overhead) Physical storage: 70TB × 1.5 = 105TB Total Physical Storage: Hot: 90TB + Cold: 105TB = 195TB consumed Storage Freed: 300TB - 195TB = 105TB New Storage Capacity: With improved storage efficiency (averaging 1.95x overhead instead of 3x): Before: 500TB ÷ 3.0 = 166TB maximum logical capacity After: 500TB ÷ 1.95 = 256TB maximum logical capacity Result: Can now store 90TB more logical data (54% capacity increase) The Bottom Line: Your organization freed up 105TB of physical storage, allowing you to store significantly more data without purchasing additional hardware—deferring operational expenses by 2+ years. How It Works—The Simple Version Storage Optimizer runs automatically every day through five intelligent stages: Monitor Access Patterns: Tracks which files are being used and which aren't Collect Metadata: Gathers information about your files (size, age, type) Analyze Data: Identifies which files are truly "cold" based on usage Apply Policies: Uses your configured rules (e.g., "convert files not accessed in 30 days") Optimize Storage: Automatically converts cold files to efficient format All of this happens in the background without impacting your applications or users. Getting Started—3 Simple Steps Step 1: Choose Your Policy Storage Optimizer provides a simple web interface to configure your optimization rules. Three common approaches: Balanced Approach (Recommended for most organizations) Convert files not accessed in 30 days Minimum file size: 32MB Best for: General-purpose data lakes Aggressive Approach (Maximum savings) Convert files not accessed in 7 days Minimum file size: 1MB Best for: Log data and temporary files Conservative Approach (Minimal risk) Convert files not accessed in 90 days Minimum file size: 100MB Best for: Critical archived data Step 2: Set Exclusions (Optional) Protect specific data that should never be converted: Critical real-time analytics data Frequently accessed archives Compliance-protected data Step 3: Monitor Results Track your savings through the built-in dashboard: Storage freed Data converted System health Storage Optimizer Use Cases: Maximizing Value Across Ozone Deployments Ozone serves as a scalable object store for diverse enterprise workloads—from traditional big data analytics to modern cloud-native applications. Storage Optimizer enhances every Ozone deployment by automatically reducing storage usage while maintaining full data accessibility. Here's how Storage Optimizer delivers value across common Ozone scenarios: Enterprise Data Lakes and Big Data Analytics The Scenario: Organizations use Ozone to store petabytes of data for data lakes, big data analytics, and IoT applications. Batch processing with Spark, Hive, and other Hadoop tools generates massive datasets—historical reports, archived logs, sensor data, and analytical results. How Storage Optimizer Helps: As analytics datasets age, 60-80% becomes cold but must remain accessible for compliance or ad-hoc queries. Storage Optimizer automatically identifies and converts this cold data from 3x replication to Erasure Coding. Result: 45-60% storage reduction for cold data, enabling extended retention periods without capacity expansion. A 100TB analytics dataset (300TB physical with replication) can be reduced to ~195TB after optimization—freeing 105TB for new data. Unified Hadoop and Object Storage Workloads The Scenario: Ozone provides dual-protocol support (ofs:// filesystem and S3-compatible API), serving as a unified layer for both Hadoop batch jobs and modern object-store applications. This eliminates the need for separate storage systems but means diverse data accumulates rapidly. How Storage Optimizer Helps: Storage Optimizer works seamlessly across both protocols. Whether data is written via Hadoop filesystem APIs or S3 SDK, Storage Optimizer analyzes access patterns and converts cold objects to EC format—regardless of how they were created or accessed. Result: 50-60% storage savings across the entire unified storage layer. Both Hadoop-generated datasets and S3-uploaded objects benefit from automatic optimization, with no application code changes required. Large-Scale HDFS Migration to Ozone The Scenario: Organizations migrate from HDFS to Ozone to overcome scalability limitations. Preferred Networks migrated to Ozone and scaled to 10 billion objects, leveraging Ozone's separated metadata architecture (Ozone Manager + Storage Container Manager) to eliminate HDFS NameNode bottlenecks. How Storage Optimizer Helps: Post-migration, decades of accumulated HDFS data sits in Ozone—much of it cold. Manually identifying and converting cold data across billions of objects is impractical. Storage Optimizer automatically analyzes the entire migrated dataset and converts cold objects without human intervention. Result: 35-50% capacity freed post-migration. An organization migrating 500TB of HDFS data (1.5PB physical with replication) can reduce physical consumption to ~1PB after Storage Optimizer processes cold data—deferring infrastructure expansion by 2+ years. Hybrid Cloud and Cloud-Bursting Architectures The Scenario: Microsoft OneLake integration with Ozone enables organizations to virtualize their Cloudera/Hadoop data into Microsoft Fabric, supporting cloud-bursting where workloads dynamically shift between on-premises and cloud based on demand. How Storage Optimizer Helps: Hybrid architectures create storage pressure on on-premises storage. Storage Optimizer reduces this pressure by converting cold on-premises data to EC format, lowering physical storage requirements by 50% without moving data to cloud (which incurs egress costs). Result: Delay expensive on-premises capacity upgrades while maintaining cloud flexibility. Organizations can keep more data on-premises in optimized format, reserving cloud bursting for compute rather than cold storage. Open Data Lakehouse with Iceberg The Scenario: Ozone serves as the object store for Open Data Lakehouse powered by Iceberg, combining data lake flexibility with data warehouse performance. Iceberg's time travel and ACID capabilities require maintaining extensive data histories and snapshots. How Storage Optimizer Helps: Lakehouse time travel features generate numerous historical snapshots. Recent snapshots need fast access; older versions are rarely queried but must be retained. Storage Optimizer automatically converts historical snapshots and old table versions to EC format. Result: 50-53% reduction in long-term archival storage. Organizations maintain complete data lineage and time travel capabilities without storage penalty. A lakehouse with 50TB of current data plus 200TB of historical snapshots (600TB physical with replication) can be reduced to ~370TB after optimization. Log Aggregation and Compliance Retention The Scenario: Enterprises store application logs, audit trails, and system telemetry in Ozone for compliance, security analysis, and troubleshooting. Recent logs (last 7-30 days) require fast access for debugging; older logs are rarely queried but must meet regulatory retention periods (often 1-7 years). How Storage Optimizer Helps: Aggressive optimization policy converts logs older than 7 days to EC format while keeping recent logs in fast-access replication. This matches the actual access pattern—frequent queries on recent logs, rare access to historical logs. Result: 70% storage reduction for log data, enabling 3x longer retention within the same capacity. Critical for meeting compliance requirements without purchasing additional storage. An organization retaining 1 year of logs at 10TB/month (360TB physical) can extend to 3 years of retention in ~400TB after optimization. Enterprise-Grade Security Storage Optimizer maintains your existing security controls: Kerberos authentication for secure access TLS/SSL encryption for data in transit Ranger integration for access control Audit trails for compliance Checksum verification ensures data integrity Zero data loss through proven technology Your data remains just as secure after optimization as before—we just store it more efficiently. Why This Matters Now As data continues to grow exponentially, storage usage becomes a larger portion of IT budgets. Traditional approaches—buying more storage or deleting old data—aren't sustainable. Storage Optimizer offers a third path: intelligent efficiency. Keep all your data accessible while dramatically reducing storage. The ROI Is Clear Setup time: 2-3 days one-time configuration Ongoing effort: None (fully automated) Storage savings: 45-60% typical Payback period: Immediate Business impact: Defer hardware purchases by years Start Optimizing Today Most organizations see immediate value after enabling Storage Optimizer: Week 1-2: Start with a pilot—select one namespace, use conservative settings, monitor results Week 3+: Expand coverage—reduce thresholds, include more data, optimize settings Ongoing: Watch savings grow—system runs automatically, freeing up more storage daily Ready to reduce your storage usage by half? Next Steps: Join our community to connect with other users Explore our training on CDP and Apache Ozone Technical Note: Storage Optimizer is fully integrated with Cloudera Data Platform (CDP) Private Cloud 7.3.2.x. For detailed configuration and technical documentation, visit Cloudera's documentation portal.
... View more
Labels:
03-06-2025
09:34 PM
Possible Causes & Solutions: 1. Token Expired The error message shows the token’s maxDate=2025-03-13T06:57:54.532Z, which means it is valid until March 13, 2025. However, if the token was already expired or incorrectly renewed, it would cause this failure. Solution: Check token validity using the klist command: klist -e If expired, renew the token manually: hdfs dfs -renewDelegationToken <token> 2. Incorrect or Missing Kerberos Credentials If the application is running on a Kerberized cluster, YARN must have a valid Kerberos ticket. Solution: Ensure the Kerberos ticket is valid: If the ticket is expired, re-authenticate: kinit -kt /etc/security/keytabs/yarn.service.keytab yarn/<hostname>@PLATFORMKRB.COM 3. Incorrect Token Renewer Principal The error message includes: renewer=yarn This means the token was issued for YARN to renew, but YARN may not have the required permissions. Solution: Ensure the renewer principal is correctly configured in core-site.xml: <property>
<name>hadoop.security.auth_to_local</name>
<value>RULE:[1:$1@$0](yarn@PLATFORMKRB.COM)s/.*/yarn/</value>
</property> Restart YARN after making changes. 4. Misconfigured Ozone Token in YARN If the Ozone token is not being properly renewed, you may need to refresh it. Solution: Try manually obtaining a new Ozone token: hdfs fetchdt -fs o3fs://<bucket>.<volume>/<host>:9862 <token-file> Pass the token explicitly when running the application. 5. OM (Ozone Manager) Certificate or Service ID Issues The error mentions omServiceId=<hostname> and omCertSerialId=6. If the OM certificate has expired or the service ID is incorrect, it can prevent token renewal. Solution: Check if the OM certificate is valid: ozone admin cert list If expired, renew the certificates and restart OM. 6. Mismatched Ozone and YARN Versions If the Ozone and YARN versions are incompatible, token renewal may fail. Solution: Check the Ozone and YARN versions: ozone version
yarn version Ensure they are compatible. Next Steps: a. Check if the Kerberos ticket is valid (klist). b. Try renewing the token manually (hdfs dfs -renewDelegationToken). c. Ensure the correct renewer principal is configured in YARN. d. Verify Ozone Manager certificates (ozone admin cert list). e. Restart YARN and OM after making changes.
... View more
01-25-2023
01:05 AM
Ok guys, I figured this out myself: curl -ikv --negotiate -u: https://dev-ran-7.dev-ran.root.hwx.site:8995/solr/ranger_audits/select -d ' q=*:*&wt=json&fl=access%2C%20agent%2C%20repo%2C%20resource%2C%20resType%2C%20event_count&fq=access%3Aread&fq=repo%3Acm_ozone&fq=-repoType%3A7&fq=resType%3Akey&fq=evtTime%3A%5B2023-01-23T18%3A30%3A00Z+TO+NOW%5D& json.facet={ resources:{ type : terms, field : resource, facet:{ read_access_count : "sum(event_count)" } } }'
... View more
01-25-2023
12:21 AM
Hello guys, Can someone please help me, how I can write a solr query where results to be grouped on one field (text type) and sum on another field. Here is my documents results: { "responseHeader": { "zkConnected": true, "status": 0, "QTime": 30, "params": { "q": "*:*", "doAs": "knoxui", "fl": "access, agent, repo, resource, resType, event_count", "fq": [ "access:read", "repo:cm_ozone", "resType:key", "action:read" ], "_forwardedCount": "1", "_": "1674633786010" } }, "response": { "numFound": 8, "start": 0, "docs": [ { "access": "read", "agent": "ozone", "repo": "cm_ozone", "resource": "volume1/fso-bucket/dir1/dir2/dir3", "resType": "key", "event_count": 1 }, { "access": "read", "agent": "ozone", "repo": "cm_ozone", "resource": "volume1/fso-bucket/dir1", "resType": "key", "event_count": 1 }, { "access": "read", "agent": "ozone", "repo": "cm_ozone", "resource": "volume1/fso-bucket/dir1/test3.txt", "resType": "key", "event_count": 1 }, { "access": "read", "agent": "ozone", "repo": "cm_ozone", "resource": "volume1/fso-bucket/dir1/dir2", "resType": "key", "event_count": 1 }, { "access": "read", "agent": "ozone", "repo": "cm_ozone", "resource": "volume1/fso-bucket/dir1/dir2/dir3", "resType": "key", "event_count": 1 }, { "access": "read", "agent": "ozone", "repo": "cm_ozone", "resource": "volume1/fso-bucket/dir1", "resType": "key", "event_count": 1 }, { "access": "read", "agent": "ozone", "repo": "cm_ozone", "resource": "volume1/fso-bucket/dir1/test3.txt", "resType": "key", "event_count": 1 }, { "access": "read", "agent": "ozone", "repo": "cm_ozone", "resource": "volume1/fso-bucket/dir1/dir2", "resType": "key", "event_count": 1 } ] } } Here I need something like equivalent to SQL: select resource, count(event_count) from <docs> group by resource;
... View more
Labels:
- Labels:
-
Apache Solr