Community Articles

Find and share helpful community-sourced technical articles.
Announcements
Celebrating as our community reaches 100,000 members! Thank you!
avatar
Cloudera Employee

Introduction

Cloudera Operational Database (COD) serves as a foundational service within the Cloudera Data Platform (CDP), enabling users to effortlessly create operational databases that dynamically scale to meet workload demands. When deploying high-performance applications at scale, a robust operational database plays a crucial role. COD addresses this need by offering a highly scalable and high-performance operational database engineered to support data-intensive applications.

Leveraging the robust foundations of Apache HBase and Apache Phoenix, COD is integrated into the Cloudera Data Platform (CDP) in the public cloud. It is designed for versatility, accommodating hybrid as well as multi-cloud deployments, ensuring adaptability across various cloud environments including Amazon AWS, Microsoft Azure, and Google GCP.

For AWS deployments, COD provides two primary storage options: S3 with Ephemeral Cache, known for its high performance albeit with slightly higher costs, and S3 without Ephemeral Cache, offering a more budget-friendly solution albeit with reduced performance capabilities. Recently, AWS introduced "Express S3", a streamlined iteration of S3 claiming a 10x increase in speed compared to the standard version. However, its current availability is limited to a single zone, resulting in diminished durability compared to regular S3. The speed of Express S3 intrigued us, leading to an exploration of its potential as a primary storage solution to achieve high performance without relying on Ephemeral Cache.

Consequently, we embarked on evaluating this new storage type, particularly for users who are comfortable with the existing durability parameters of Express S3.

In the following sections, we delve into the benchmarking results, comparing the performance of all three storage types. We provide conclusions that can guide decision-making processes for users leveraging the Cloudera Operational Database on AWS.

Methodology

We use the Yahoo! Cloud Serving Benchmark (YCSB) framework for our performance testing. YCSB serves as an open-source benchmarking suite tailored for evaluating performance metrics. It is widely adopted for measuring the efficiency of database systems across multi-node setups, including those deployed on public cloud environments.

Dataset

For this performance assessment, a substantial dataset comprising 20TB was generated and securely stored within an S3 bucket. This dataset remains consistent across all tests, ensuring uniformity and comparability in the evaluation process.

Key Dataset Details:

  • Data size: 20TB
  • Number of rows in the table: 20 billion

Environment

The benchmarking environment is configured within the AWS infrastructure, with the following specifications:

  • Number of Master Nodes: 2 (m5.2xlarge)
  • Number of Leader Nodes: 1 (m5.2xlarge)
  • Number of Gateway Nodes: 1 (m5.2xlarge)
  • Number of Worker Nodes: 20 (i3.2xlarge)

YCSB details

The benchmarking tests were conducted using the YCSB tool, with a focus on specific workloads tailored to assess the performance characteristics of the Cloudera Operational Database.

YCSB Workloads Employed:

  • Workload A:
    • Update Heavy Workload
    • 50% Read, 50% Write
  • Workload C:
    • 100% Read
  • Workload F:
    • Read-Modify-Update Workload
    • 50% Read, 25% Update, 25% Read-Modify-Update

Testing Parameters

  • Each workload was executed for a duration of 15 minutes (900 seconds) following the specified sequence:
    • Warm-up Run: Workload C - Intended to prime the cache for subsequent workload executions.
    • Workload A
    • Workload C
    • Workload F
  • Sample Set for Running the Workloads:
    • 1 Billion Rows
    • 100 Million Batch Size

Additional Information: In the case of S3 with Ephemeral Cache, the cache was 100% warmed up before running the tests.

Results

The table below presents all the collected performance indicators across different storage types:

Screenshot 2024-03-15 at 5.03.09 PM.png

The charts below provide comparisons of key performance indicators.

Average Throughput

Average Throughput.png

The above chart illustrates the average throughput observed during the YCSB tests. Notably, S3 with Ephemeral Cache demonstrates a throughput approximately 15-20 times higher than S3 without Ephemeral Cache. Although Express S3, which operates without a cache, displays promising performance compared to standard S3, it falls short of the performance levels achieved by S3 with Ephemeral Cache.

Read Latency

Read Latency.png

The chart above depicts the latency observed during read-based workloads. S3 with Ephemeral Cache exhibits significantly lower read latency when compared to other storage types. Express S3 also demonstrates improved latency performance compared to standard S3.

Summary

Based on the aforementioned results, it's evident that S3 with Ephemeral Cache emerges as the optimal storage solution for the Cloudera Operational Database in terms of performance. While Express S3 demonstrates improved performance compared to standard S3, it falls short of surpassing the performance achieved by S3 with Ephemeral Cache. Moreover, considering the limitations of Express S3 being confined to a single zone, it may not be the most suitable choice for users seeking optimal performance and durability simultaneously.

References

For further insights into performance evaluations of the Cloudera Operational Database (COD), the following resources may be of interest:

  1. "Cloudera Operational Database (COD) Performance Benchmarking: Comparing HDFS and Cloud Storage" - This blog provides a comparative analysis of COD performance running on HDFS versus cloud storage offered by Amazon AWS and Microsoft Azure. Access the blog here.
  2. "Performance Comparison of Cloudera Operational Database (COD) on AWS, Azure, and GCP with Ephemeral Cache Enabled" - This document offers insights into the performance comparison of COD across AWS, Azure, and GCP with ephemeral cache enabled. Access the document here.
  3. "How to run YCSB for HBase" - For detailed instructions on running Yahoo! Cloud Serving Benchmark (YCSB) for HBase, refer to this blog post. Access the blog here.

For additional information on Cloudera Operational Database, including product features and capabilities, visit the product page or reach out to your account team for personalized assistance.

925 Views
0 Kudos