What's New @ Cloudera

Find the latest Cloudera product news

Performance comparison of Spark3 on YARN with S3 Standard VS S3 Express One Zone on AWS

avatar
Contributor

Introduction

As organizations increasingly migrate their data workloads to the cloud, choosing the right storage solution is crucial. This article compares Spark3 on YARN workloads using two different Amazon S3 storage classes: S3 Standard and S3 Express One Zone running Cloudera on AWS with a 7.2.18 CDP runtime version. We'll look at the results of a series of benchmark tests highlighting the performance differences, cost implications, and key considerations for using these two storage options.

Background and Motivation

In cloud computing, selecting the right storage solution directly impacts the overall performance, availability, and cost of your data applications. Spark workloads that use Amazon S3 storage often face choices between multiple storage classes, each offering distinct features and trade-offs. This blog provides insight into the performance of Spark3 on YARN workloads when using S3 Standard compared to S3 Express One Zone, leveraging the industry-standard benchmarking methodology (TPC-DS). This comparison will help decision-makers determine which storage class is best suited for their specific workloads.

Test Environment & Cluster Configuration

Datagen: https://github.com/databricks/spark-sql-perf

Database: dex_tpcds_sf1000_withdecimal_withdate_withnulls (Data Size: 1TB, Format: Parquet)

parquet.memory.pool.ratio: 0.1

spark.sql.parquet.compression.codec: snappy

spark.sql.shuffle.partitions: 2000

spark.sql.files.maxRecordsPerFile: 20000000

Table and Column Statistics computed: yes (using below statements)

ANALYZE TABLE $databaseName.$name COMPUTE STATISTICS

ANALYZE TABLE $databaseName.$name COMPUTE STATISTICS FOR COLUMNS $allColumns

We ran all of the tests using the YCSB benchmarking tool on Spark3 on YARN with the following configurations:

Amazon AWS

  • Cloudera Manager Version: 7.12.0.0
  • Cloudera Runtime Version: 7.2.18.0-452
  • Spark Version: 3.4.1

Spark service-related configuration: 

  • Spark Dynamic allocation enabled: False
  • Driver Memory: 16G (Cores: 2)
  • Executor Memory: 16G (Cores: 2)
  • No. of Executors: 27

Yarn CPU & Memory resources consumption(Total)

  • CPU: 65 vCores
  • Memory (GiB): 540 GB

The storage classes used for the benchmark tests included Amazon S3 Standard, which offers high availability and durability, and S3 Express One Zone, which provides a lower-cost storage option with reduced redundancy. During the test, we executed each workload five times and used the average runtime for comparison.

Test Methodology

The benchmark tests used the TPC-DS dataset, the industry standard for measuring the performance of data processing systems. The test aimed to evaluate the total runtime of a variety of SQL-like queries using Spark3 running on YARN while leveraging different Amazon S3 storage classes. We executed all selected queries under the same YARN application context to ensure consistency.

  • Generate Parquet format data of size 1TB in S3 Standard and S3 Express One Zone
  • Create tables on top of the data and compute table/column statistics
  • Execute all TPC-DS queries(read-only queries) - 102 Queries tested

Results and Analysis

image1.png

image2.png

Key Observations:

  • The S3 Express One Zone storage class generally provided better performance with lower execution times for most queries, benefiting from its reduced redundancy and cost-effective structure.
  • S3 Standard demonstrated longer runtimes compared to S3 Express One Zone, which is expected given its higher redundancy and durability.
  • The following are the queries that show notable differences in runtime performance:

Image3.png

Cost Efficiency and Performance Correlation

S3 Express One Zone is up to 50% cheaper in request costs and demonstrated a 38% reduction in average runtime compared to S3 Standard. It provides high-performance, single-availability Zone storage that delivers consistent single-digit millisecond access. Co-locating storage and computing in the same Availability Zone reduces latency, leading to faster workloads and lower compute resource usage. S3 Express One Zone is the ideal choice for non-critical workloads where cost efficiency, low latency, and high performance are key priorities.

The combination of lower storage costs and faster execution times makes S3 Express One Zone a compelling option for use cases where data availability is not the highest priority. However, for mission-critical workloads that require high availability and resilience, S3 Standard may still be preferable despite the higher cost and longer runtimes due to its durability and redundancy.

image4.png

Things to Consider When Choosing the Right S3 Storage Class

  1. Data Availability Requirements: If your workloads need high availability and redundancy, S3 Standard is a reliable choice, ensuring your data remains accessible even in the event of hardware failure.
  2. Cost Considerations: For test data, non-critical applications, or data that can tolerate lower availability, S3 Express One Zone provides a more cost-effective option.
  3. Performance Sensitivity: Performance-sensitive Spark workloads may benefit from S3 Express One Zone, which demonstrated faster runtimes in our tests.
  4. Workload Nature: For batch processing jobs or non-production use cases where cost efficiency is more important than availability, S3 Express One Zone is the ideal choice.

When to use S3 Express One Zone

S3 Express One Zone is ideal for:

  • Workloads where cost efficiency is a higher priority than high availability, such as test and development environments.
  • Applications that require low latency and faster data access, such as video streaming or financial simulations. By co-locating storage and computing in the same availability zone, S3 Express One Zone reduces latency and ensures faster data processing.
  • Scenarios where durability and high availability are not critical. This includes batch jobs or workloads that can tolerate occasional data unavailability, focusing instead on cost efficiency and performance.

S3 Express One Zone is supported only for Data Hub (compute) Clusters and is not currently available for Cloudera Data Services. Additionally, replacing the default Amazon S3 storage for Datalake with S3 Express One Zone is not supported. 

When to use S3 Standard 

S3 Standard is ideal for:

  • Mission-critical workloads that require high durability and availability across multiple availability zones. 
  • Customer-facing applications or production workloads where data availability is crucial to maintaining Service-Level Agreements (SLAs).
  • Use cases involving regulatory or compliance requirements that dictate data redundancy across multiple locations. 
  • Data lake storage for long-term, high-availability storage of critical business data that must be accessed across multiple applications.
  • Backup and disaster recovery solutions where durability and access in the event of failure are critical to ensure business continuity.

Conclusion

In this analysis, the S3 Express One Zone storage class consistently provided better performance for Spark3 on YARN workloads, making it suitable for data processing workloads that can tolerate low redundancy but require fast execution. On the other hand, S3 Standard offers increased durability and availability, which may be necessary for mission-critical workloads. The right choice ultimately depends on the specific workload requirements and the balance between cost, availability, and performance. We hope this comparison helps you make informed decisions about leveraging the best Amazon S3 storage option for your Spark workloads in the cloud.

Visit the product page to learn more about usage and steps to update the configuration or reach out to your account team. Additionally, start your free 5-day trial of Cloudera's public cloud services to experience the platform firsthand.