Created on
01-31-2025
09:51 PM
- last edited on
02-12-2025
11:29 PM
by
VidyaSargur
As organizations increasingly migrate their data workloads to the cloud, choosing the right storage solution is crucial. This article compares Spark3 on YARN workloads using two different Amazon S3 storage classes: S3 Standard and S3 Express One Zone running Cloudera on AWS with a 7.2.18 CDP runtime version. We'll look at the results of a series of benchmark tests highlighting the performance differences, cost implications, and key considerations for using these two storage options.
In cloud computing, selecting the right storage solution directly impacts the overall performance, availability, and cost of your data applications. Spark workloads that use Amazon S3 storage often face choices between multiple storage classes, each offering distinct features and trade-offs. This blog provides insight into the performance of Spark3 on YARN workloads when using S3 Standard compared to S3 Express One Zone, leveraging the industry-standard benchmarking methodology (TPC-DS). This comparison will help decision-makers determine which storage class is best suited for their specific workloads.
Datagen: https://github.com/databricks/spark-sql-perf
Database: dex_tpcds_sf1000_withdecimal_withdate_withnulls (Data Size: 1TB, Format: Parquet)
parquet.memory.pool.ratio: 0.1
spark.sql.parquet.compression.codec: snappy
spark.sql.shuffle.partitions: 2000
spark.sql.files.maxRecordsPerFile: 20000000
Table and Column Statistics computed: yes (using below statements)
ANALYZE TABLE $databaseName.$name COMPUTE STATISTICS
ANALYZE TABLE $databaseName.$name COMPUTE STATISTICS FOR COLUMNS $allColumns
We ran all of the tests using the YCSB benchmarking tool on Spark3 on YARN with the following configurations:
Amazon AWS
Spark service-related configuration:
Yarn CPU & Memory resources consumption(Total)
The storage classes used for the benchmark tests included Amazon S3 Standard, which offers high availability and durability, and S3 Express One Zone, which provides a lower-cost storage option with reduced redundancy. During the test, we executed each workload five times and used the average runtime for comparison.
The benchmark tests used the TPC-DS dataset, the industry standard for measuring the performance of data processing systems. The test aimed to evaluate the total runtime of a variety of SQL-like queries using Spark3 running on YARN while leveraging different Amazon S3 storage classes. We executed all selected queries under the same YARN application context to ensure consistency.
Cost Efficiency and Performance Correlation
S3 Express One Zone is up to 50% cheaper in request costs and demonstrated a 38% reduction in average runtime compared to S3 Standard. It provides high-performance, single-availability Zone storage that delivers consistent single-digit millisecond access. Co-locating storage and computing in the same Availability Zone reduces latency, leading to faster workloads and lower compute resource usage. S3 Express One Zone is the ideal choice for non-critical workloads where cost efficiency, low latency, and high performance are key priorities.
The combination of lower storage costs and faster execution times makes S3 Express One Zone a compelling option for use cases where data availability is not the highest priority. However, for mission-critical workloads that require high availability and resilience, S3 Standard may still be preferable despite the higher cost and longer runtimes due to its durability and redundancy.
S3 Express One Zone is ideal for:
S3 Express One Zone is supported only for Data Hub (compute) Clusters and is not currently available for Cloudera Data Services. Additionally, replacing the default Amazon S3 storage for Datalake with S3 Express One Zone is not supported.
S3 Standard is ideal for:
In this analysis, the S3 Express One Zone storage class consistently provided better performance for Spark3 on YARN workloads, making it suitable for data processing workloads that can tolerate low redundancy but require fast execution. On the other hand, S3 Standard offers increased durability and availability, which may be necessary for mission-critical workloads. The right choice ultimately depends on the specific workload requirements and the balance between cost, availability, and performance. We hope this comparison helps you make informed decisions about leveraging the best Amazon S3 storage option for your Spark workloads in the cloud.
Visit the product page to learn more about usage and steps to update the configuration or reach out to your account team. Additionally, start your free 5-day trial of Cloudera's public cloud services to experience the platform firsthand.