Community Articles

Find and share helpful community-sourced technical articles.
avatar
Cloudera Employee

Introduction

Cloudera Operational Database (COD) is a service that runs on the Cloudera Data Platform (CDP). COD enables you to create a new operational database that automatically scales based on your workload. To deploy high-performance applications at scale, a rugged operational database is essential. COD is a high-performance and highly scalable operational database designed for powering, at any scale, the biggest data applications on the planet. Powered by Apache HBase and Apache Phoenix, COD ships out of the box with Cloudera Data Platform (CDP) in the public cloud. It’s also ready for hybrid and multi-cloud deployments to meet your business where it is today, whether AWS, Microsoft Azure, or GCP.

Support for cloud storage is an important capability of COD that, in addition to the pre-existing support for HDFS on local storage, offers customers a choice of price-performance characteristics. Please refer to the blog for more information on the performance differences between COD on HDFS and COD on cloud storage with ephemeral cache (Amazon AWS and Microsoft Azure).

To understand how COD delivers the best cost-efficient performance for your applications, let’s dive into benchmarking results comparing COD using different cloud storages.

Methodology

The tests were performed on a data set created using the Yahoo! Cloud Serving Benchmark (YCSB) test framework on AWS. YCSB is an open-source benchmarking suite for performance evaluations. It is frequently used to measure the performance of multi-node database systems on the public cloud and other distributed infrastructure. 

In this performance evaluation, a large dataset of 20TB was generated and backed up to an S3 bucket for further use. The same data was in turn exported to run the performance tests on Azure and GCP for fair comparison.

This article measures the performance differences between Amazon AWS, Microsoft Azure, and Google GCP with ephemeral cache. It does not evaluate the performance of cloud storage, local disks, and block storage independently.

Dataset

The details of the dataset used for these performance tests are as follows:

  • Data size: 20TB
  • Number of rows in the table: 20 bn

Environment

  • AWS
    • No. of master nodes: 2 (m5.2xlarge)
    • No. of leader nodes: 1 (m5.2xlarge)
    • No. of gateway nodes: 1 (m5.2xlarge)
    • No. of worker nodes: 20 (i3.2xlarge) (Storage as S3)
  • Azure
    • No. of master nodes: 2 (Standard_D8a_V4)
    • No. of leader nodes: 1 (Standard_D8a_V4)
    • No. of gateway node: 1 (Standard_D8a_V4)
    • No. of worker nodes: 20 (Standard_L8s_V2) (Storage as ABFS)
  • GCP
    • No. of master nodes: 2 (n2-standard-8)
    • No. of leader nodes: 1 (n2-standard-8)
    • No. of gateway nodes: 1 (e2-standard-8)
    • No. of worker nodes: 20 (n2-standard-16) (Storage as GCS)

YCSB details

The tests were run using the YCSB tool. The details are given below:

  • Performance benchmarking was done using the following YCSB workloads
    • YCSB Workload A
      • Update heavy workload
      • 50% read, 50% write
    • YCSB Workload C
      • 100% read
    • YCSB Workload F
      • Read-Modify-Update workload
      • 50% read, 25% update, 25% read-modify-update
  • The following parameters were used to run the workloads using YCSB:
  • Each workload was run for 15 min (900 secs) in the following order:
    • Workload C - is a warm-up run to warm up the cache for the subsequent workload runs.
    • Workload A
    • Workload C
    • Workload F
  • Sample set for running the workloads
    • 1 billion rows
    • 100 million batch

Results

The charts below show the comparison between AWS, Azure, and GCP with 100% ephemeral cache warm-up. This ensures that most of the blocks are in the cache.

The charts below show the time taken to warm up the cache on COD on Amazon AWS and COD on GCP. It has been observed that COD on AWS takes 2x time to warm up the cache as compared to the warm up time required in GCP.

Cache warmup on COD on AWSCache warmup on COD on AWS

Cache warmup on COD on GCPCache warmup on COD on GCP


The following chart shows the comparison between some key performance indicators on AWS, Azure, and GCP cloud platforms:

Performance comparison of COD with Ephemeral cache on Amazon AWS vs. Microsoft Azure vs. GCPPerformance comparison of COD with Ephemeral cache on Amazon AWS vs. Microsoft Azure vs. GCP

 

 

 

 

 

 

 

The following chart shows the average throughput observed while running the YCSB tests. It has been observed that the average throughput of HBase running on Google GCS is better than the throughput observed on HBase with Amazon AWS and Microsoft Azure in different types of workloads. Hence, HBase with Google GCS gives a better overall performance over other cloud providers.

Average throughput comparison between Amazon S3 vs. Microsoft ABFS vs. Google GCSAverage throughput comparison between Amazon S3 vs. Microsoft ABFS vs. Google GCS

 

 

 

 

 

 

 

 

The following chart shows the latency observed while running the workloads involving reads.

The results show that HBase with Google GCS has better latencies as compared to Amazon AWS and Microsoft Azure in the case of read-only workload viz. workload-c, while they are comparable in a mixed workload like workload-a.

Read latency comparison between Amazon S3 vs. Microsoft ABFS vs. Google GCSRead latency comparison between Amazon S3 vs. Microsoft ABFS vs. Google GCS

 

 

 

 

 

 

 

 

 

 

The following chart shows the latency observed while running workloads involving writes.

The results show that the write latency observed while running HBase with Google GCS is better than the HBase with Amazon AWS and Microsoft Azure by a large margin.

Write latency comparison between Amazon S3 vs. Microsoft ABFS vs. Google GCSWrite latency comparison between Amazon S3 vs. Microsoft ABFS vs. Google GCS

 

 

 

 

 

 

 

Summary

The above comparison shows that GCP with GCS is found to be performing better as compared to Amazon AWS and Microsoft Azure with better overall throughput and better read/write latencies while running the workloads. The write latencies for GCP with GCS were found to be way better than the other two platforms, which is owing to the performance of the block storage in GCS.

References

A similar performance experiment was performed to compare the performance of COD running on HDFS vs. COD running on cloud storage provided by Amazon AWS and Microsoft Azure. The details of these experiments can be found in the blog titled Cloudera Operational Database (COD) Performance Benchmarking: Comparing HDFS and Cloud Storage.

A detailed description of how to run YCSB for HBase can be found in the blog titled How to run YCSB for HBase.

Visit the product page to learn more about the Cloudera Operational Database or reach out to your account team. 

733 Views
0 Kudos