Member since
04-14-2017
12
Posts
3
Kudos Received
1
Solution
My Accepted Solutions
Title | Views | Posted |
---|---|---|
6571 | 01-10-2023 09:25 AM |
01-31-2025
09:51 PM
Introduction
As organizations increasingly migrate their data workloads to the cloud, choosing the right storage solution is crucial. This article compares Spark3 on YARN workloads using two different Amazon S3 storage classes: S3 Standard and S3 Express One Zone running Cloudera on AWS with a 7.2.18 CDP runtime version. We'll look at the results of a series of benchmark tests highlighting the performance differences, cost implications, and key considerations for using these two storage options.
Background and Motivation
In cloud computing, selecting the right storage solution directly impacts the overall performance, availability, and cost of your data applications. Spark workloads that use Amazon S3 storage often face choices between multiple storage classes, each offering distinct features and trade-offs. This blog provides insight into the performance of Spark3 on YARN workloads when using S3 Standard compared to S3 Express One Zone, leveraging the industry-standard benchmarking methodology (TPC-DS). This comparison will help decision-makers determine which storage class is best suited for their specific workloads.
Test Environment & Cluster Configuration
Datagen: https://github.com/databricks/spark-sql-perf
Database: dex_tpcds_sf1000_withdecimal_withdate_withnulls (Data Size: 1TB, Format: Parquet)
parquet.memory.pool.ratio: 0.1
spark.sql.parquet.compression.codec: snappy
spark.sql.shuffle.partitions: 2000
spark.sql.files.maxRecordsPerFile: 20000000
Table and Column Statistics computed: yes (using below statements)
ANALYZE TABLE $databaseName.$name COMPUTE STATISTICS
ANALYZE TABLE $databaseName.$name COMPUTE STATISTICS FOR COLUMNS $allColumns
We ran all of the tests using the YCSB benchmarking tool on Spark3 on YARN with the following configurations:
Amazon AWS
Cloudera Manager Version: 7.12.0.0
Cloudera Runtime Version: 7.2.18.0-452
Spark Version: 3.4.1
Spark service-related configuration:
Spark Dynamic allocation enabled: False
Driver Memory: 16G (Cores: 2)
Executor Memory: 16G (Cores: 2)
No. of Executors: 27
Yarn CPU & Memory resources consumption(Total)
CPU: 65 vCores
Memory (GiB): 540 GB
The storage classes used for the benchmark tests included Amazon S3 Standard, which offers high availability and durability, and S3 Express One Zone, which provides a lower-cost storage option with reduced redundancy. During the test, we executed each workload five times and used the average runtime for comparison.
Test Methodology
The benchmark tests used the TPC-DS dataset, the industry standard for measuring the performance of data processing systems. The test aimed to evaluate the total runtime of a variety of SQL-like queries using Spark3 running on YARN while leveraging different Amazon S3 storage classes. We executed all selected queries under the same YARN application context to ensure consistency.
Generate Parquet format data of size 1TB in S3 Standard and S3 Express One Zone
Create tables on top of the data and compute table/column statistics
Execute all TPC-DS queries(read-only queries) - 102 Queries tested
Results and Analysis
Key Observations:
The S3 Express One Zone storage class generally provided better performance with lower execution times for most queries, benefiting from its reduced redundancy and cost-effective structure.
S3 Standard demonstrated longer runtimes compared to S3 Express One Zone, which is expected given its higher redundancy and durability.
The following are the queries that show notable differences in runtime performance:
Cost Efficiency and Performance Correlation
S3 Express One Zone is up to 50% cheaper in request costs and demonstrated a 38% reduction in average runtime compared to S3 Standard. It provides high-performance, single-availability Zone storage that delivers consistent single-digit millisecond access. Co-locating storage and computing in the same Availability Zone reduces latency, leading to faster workloads and lower compute resource usage. S3 Express One Zone is the ideal choice for non-critical workloads where cost efficiency, low latency, and high performance are key priorities.
The combination of lower storage costs and faster execution times makes S3 Express One Zone a compelling option for use cases where data availability is not the highest priority. However, for mission-critical workloads that require high availability and resilience, S3 Standard may still be preferable despite the higher cost and longer runtimes due to its durability and redundancy.
Things to Consider When Choosing the Right S3 Storage Class
Data Availability Requirements: If your workloads need high availability and redundancy, S3 Standard is a reliable choice, ensuring your data remains accessible even in the event of hardware failure.
Cost Considerations: For test data, non-critical applications, or data that can tolerate lower availability, S3 Express One Zone provides a more cost-effective option.
Performance Sensitivity: Performance-sensitive Spark workloads may benefit from S3 Express One Zone, which demonstrated faster runtimes in our tests.
Workload Nature: For batch processing jobs or non-production use cases where cost efficiency is more important than availability, S3 Express One Zone is the ideal choice.
When to use S3 Express One Zone
S3 Express One Zone is ideal for:
Workloads where cost efficiency is a higher priority than high availability, such as test and development environments.
Applications that require low latency and faster data access, such as video streaming or financial simulations. By co-locating storage and computing in the same availability zone, S3 Express One Zone reduces latency and ensures faster data processing.
Scenarios where durability and high availability are not critical. This includes batch jobs or workloads that can tolerate occasional data unavailability, focusing instead on cost efficiency and performance.
S3 Express One Zone is supported only for Data Hub (compute) Clusters and is not currently available for Cloudera Data Services. Additionally, replacing the default Amazon S3 storage for Datalake with S3 Express One Zone is not supported.
When to use S3 Standard
S3 Standard is ideal for:
Mission-critical workloads that require high durability and availability across multiple availability zones.
Customer-facing applications or production workloads where data availability is crucial to maintaining Service-Level Agreements (SLAs).
Use cases involving regulatory or compliance requirements that dictate data redundancy across multiple locations.
Data lake storage for long-term, high-availability storage of critical business data that must be accessed across multiple applications.
Backup and disaster recovery solutions where durability and access in the event of failure are critical to ensure business continuity.
Conclusion
In this analysis, the S3 Express One Zone storage class consistently provided better performance for Spark3 on YARN workloads, making it suitable for data processing workloads that can tolerate low redundancy but require fast execution. On the other hand, S3 Standard offers increased durability and availability, which may be necessary for mission-critical workloads. The right choice ultimately depends on the specific workload requirements and the balance between cost, availability, and performance. We hope this comparison helps you make informed decisions about leveraging the best Amazon S3 storage option for your Spark workloads in the cloud.
Visit the product page to learn more about usage and steps to update the configuration or reach out to your account team. Additionally, start your free 5-day trial of Cloudera's public cloud services to experience the platform firsthand.
... View more
Labels:
01-10-2023
09:25 AM
2 Kudos
Hello Prakodi, We do not have a specific script to export all the DDL's of the table for a schema in hive, however, what you are looking for could be solved with the command SHOW CREATE TABLE <HIVE_Table_name>; and you will have to put this in the shell script to retrieve all the table names first and then run show create table for all the tables of the schema. I did a search online and found an external blog article doing the same. Please refer that https://dwgeek.com/export-hive-table-ddl-syntax-and-shell-script-example.html/ Let us know if you have any questions. - Varun
... View more
09-05-2019
12:26 PM
Hello @neron , Based on the error messages, it looks like you have not flushed the iptables which is requirement for setting up CDSW. You can find more information here: https://www.cloudera.com/documentation/data-science-workbench/latest/topics/cdsw_requirements_supported_versions.html#networking_security_req Flush IPTables, Stop Firewalld, Reset weave, Restart host 1) Stop all the instances CM > CDSW > Instances > All > Stop 2) SSH to the master node and clear all iptable rules. iptables -P INPUT ACCEPT iptables -P FORWARD ACCEPT iptables -P OUTPUT ACCEPT iptables -t nat -F iptables -t mangle -F iptables -F iptables -X 3) Stop firewalld service and disable it at boot time. systemctl stop firewalld systemctl disable firewalld 4) From CM selectively start only the Docker instance (this is important to carry out next step) 5) Submit the following command to reset weave: # /opt/cloudera/parcels/CDSW/cni/bin/weave reset --force 6) Stop Docker role which we started in step 4. 7). Restart Host # init 6 😎 Start Docker and Master roles to ensure if all POD's comes up fine. If yes, start the Application role. I hope this will be helpful.
... View more