Community Articles

Find and share helpful community-sourced technical articles.
avatar
Cloudera Employee

Cloudera on cloud offers a scalable, flexible, and cost-efficient solution for data management and analytics in the public cloud. It integrates data applications into a unified platform, enabling seamless data consolidation, advanced analytics, and machine learning. With its pay-as-you-go pricing model and managed services, Cloudera on cloud optimizes costs and simplifies operations. It enhances security and compliance, supports hybrid and multi-cloud environments, and provides tools for real-time data processing and self-service analytics. Cloudera on cloud empowers organizations to manage data efficiently while leveraging the latest cloud technologies.

In comparison to Cloudera on premises, costs in public cloud can escalate quickly if not carefully monitored, a common challenge with all cloud workloads. However, with effective tracking and cost optimization strategies, public cloud adoptions can offer significant benefits in performance, scalability and cost savings over traditional data center workloads. 

Cost optimization should be an ongoing initiative for every organization to fully realize the financial benefits of cloud. This blog will focus on the AWS cloud resources used by Cloudera on cloud deployments, providing insights on how to monitor and optimize these cloud resources to reduce AWS costs.

To effectively monitor and track spendings across all services in your AWS account, you can leverage cloud financial management tools like AWS Cost Explorer and Cost Anomaly Detection for detailed cost analysis. AWS Budgets can assist with better planning and cost controls, while Trusted Advisor provides recommendations to optimize your infrastructure. By using these tools, you can enhance security, improve performance, reduce unnecessary costs, and ensure that your service usage remains within set quotas, leading to more efficient and cost-effective cloud operations. 

 

AWS Cloud Resources used by Cloudera

The following AWS resources are used by Cloudera on Cloud services.

  • VPC components: VPC, subnets, route tables and security groups.
  • Compute: EC2 instances (for Cloudera Data Hub clusters, FreeIPA, etc.), EKS (for Data Services) and EBS volumes.
  • Storage: S3 buckets for data storage, logging, and backups, EFS.
  • Networking: Elastic IPs, VPC endpoints, Private links, Elastic Load Balancers, Internet and NAT gateways.
  • Security: IAM roles and policies for managing permissions, KMS.
  • Monitoring: CloudWatch for logs and metrics.
  • Database: RDS instances for managing metadata.

The primary cost drivers for AWS resources used by Cloudera on cloud include EC2, S3, RDS and data transfer fees. For this blog, we will focus on these key services, as they have the most significant impact on overall AWS costs. Additionally, we will also delve into financial levers that you could use to further reduce AWS costs.

 

Cost optimization on Amazon Elastic Cloud (EC2)

Cloudera supports a wide range of AWS EC2 instance types, including general purpose, compute-optimized, memory-optimized, storage-optimized, and GPU-optimized instances. It is crucial to optimize EC2 instance costs while preserving the required performance and availability for your applications. Consider the following strategies for optimizing EC2 instance costs.

Choose the Right Instance Type and Size

Match your workloads with the appropriate instance type and family—such as using memory-optimized instances (e.g., r-type) for memory-intensive tasks like Spark workloads, and general-purpose instances (e.g., m-type) for data lake and FreeIPA clusters. 

Leverage Cloudera’s Grafana tools and Cloudera Manager to avoid over-provisioning by analyzing usage patterns and adjusting instance sizes as needed. Additionally, AWS services such as CloudWatch, Trusted Advisor, and Compute Optimizer can help monitor instance performance, set alerts for underutilized resources, and provide recommendations for instance types and sizes based on your usage.

Opt for the latest generation of instances, as they typically offer improved performance and cost efficiency compared to older models.

SurajS_0-1742553201156.png

 

SurajS_1-1742553201292.png

 

SurajS_2-1742553201615.jpeg

Utilize Reserved Instances (RIs) and Savings Plans

Reserved Instances and Savings Plans are pricing options offered by AWS that provide cost savings compared to on-demand pricing. Savings can be up to 75% depending on the type, size and duration of the commitment. Commitments are made either for 1 or 3 years, you can also choose when to pay for the commitment, and an all upfront payment will give bigger discounts.  

  • Compute Savings Plans: Commit to a certain amount of compute usage over a one- or three-year term. This plan offers flexibility across instance types and regions.
  • EC2 Instance Savings Plans: Provide lower prices in exchange for a commitment to a specific instance family in a region.

Both Reserved Instances (RIs) and Savings Plans offer cost savings, Savings Plans provide greater flexibility compared to RIs. This flexibility allows you to utilize various instance types, sizes, regions, and even other AWS services such as Lambda and Fargate while still benefiting from cost reductions. In contrast, Reserved Instances lack this level of adaptability, as they are tied to specific instance types and families.

When purchasing a Reserved Instance (RI) or Savings Plan, you commit to paying for full 24-hour usage each day. To optimize the financial benefits of this commitment, it's essential to ensure that instances are in continuous use by evenly distributing workloads throughout the day. Alternatively, you can strategically schedule additional workloads during periods when your Cloudera instances are not in use. 

Use Spot Instances

Cloudera supports the use of Spot Instances for Data Services such as Cloudera Data Engineering, saving up to 90% compared to on-demand instances. Spot Instances are ideal for flexible, interruptible workloads where cost optimization is critical. When configuring a virtual cluster in Cloudera Data Engineering with Spot Instances enabled, you can choose whether drivers and executors run on spot or on-demand instances. By default, drivers use on-demand instances, while executors use Spot Instances. As Spot Instances could get terminated unexpectedly, affecting job performance, Cloudera advises using them only for workloads without strict SLA requirements.

For development, testing, and other workloads with flexible needs, Cloudera suggests using on-demand or Reserved (based on hours of usage) instances for drivers and Spot Instances for executors. This approach minimizes the impact of spot instance terminations, as executor failures are generally easier to recover from than driver failures. You cannot enable spot instances on an existing Cloudera Data Engineering service.

SurajS_3-1742553201672.jpeg

Implement Autocaling

Cloudera supports autoscaling for Cloudera Data Hub (YARN based templates) and all its data service functions such as Cloudera Data Engineering, Cloudera DataFlow, Cloudera Data Warehouse, Cloudera AI and Cloudera Operational Databases. Each of these services implements an autoscaling strategy specific to its own.

Cloudera Data Hub autoscaling

Cloudera Data Engineering autoscaling

Cloudera DataFlow autoscaling

Cloudera Data Warehouse autoscaling

Cloudera AI autoscaling

Cloudera Operational Database autoscaling

Implement Lifecycle, Autosuspend and Termination Policies

Cloudera Data Services supports autosuspend for virtual warehouses. This enables you to handle resources when the autoscaler has scaled back to the last executor group. 

Also consider establishing lifecycle policies to automatically terminate or deallocate instances that are no longer in use, such as temporary or test environments. Implement policies within your company or weekly scripts to regularly terminate non-production instances, unless specific exceptions apply. This proactive approach can significantly reduce unnecessary costs by ensuring that unused instances do not remain active beyond their intended lifespan. 

Additionally, monitor unused EBS volumes. Regularly conduct audits and delete any EBS volumes that are no longer needed to further optimize costs.

SurajS_4-1742553201616.jpeg

 

Cost Optimization on Amazon Simple Storage Service (S3)

Cloudera on cloud leverages Amazon S3 for its scalable and durable storage capabilities, making it suitable for managing large volumes of data with high availability. By integrating with S3, Cloudera enables you to optimize storage costs through a pay-as-you-go pricing model, ensuring that resources are utilized efficiently based on usage patterns. This integration provides a robust foundation for data lakes, backups, and various data services, enabling organizations to effectively manage and analyze their data while ensuring security, scalability and reliability.

Like all cloud services, S3 costs can escalate quickly if not monitored and optimized periodically. Cloudera enables you to tier, manage, and transition your data across different storage levels, enabling you to effectively leverage S3's cost optimization capabilities. For instance, in data lakes, organizations often store data that may not be frequently accessed or required for daily operations; such data can be moved to a more cost-effective storage tier. Archiving data from a standard tier to a colder tier, such as Amazon S3 Glacier, can lead to savings of up to 80% on storage costs, making it a strategic choice for long-term data retention.

Choose Right Storage Classes

Cloudera supports all S3 storage classes, which are tailored for various use cases. Selecting the appropriate storage class can enhance performance and significantly reduce costs. Key considerations when choosing a storage class-

  • For predictable access patterns, you can implement lifecycle policies to transition data to a lower-cost storage class optimized for infrequent or archival access as usage decreases. To effectively manage predictable workloads, utilize S3 Storage Class Analysis to monitor access patterns across objects and determine the optimal timing for shifting data to the appropriate storage class, thereby maximizing cost efficiency.
  • For unpredictable access patterns such as data lakes and data analytics use cases the access patterns can fluctuate throughout the year, ranging from minimal to frequent usage, which can lead to high retrieval costs if stored in infrequent access or archive classes. For these workloads, consider utilizing S3 Intelligent-Tiering, which automatically optimizes costs based on changing access patterns, thereby enhancing the efficiency of your Cloudera platform.
  • Choosing a lower-cost storage class does not automatically lead to reduced S3 expenses. In addition to storage fees, charges may apply for data retrieval, minimum storage durations, and API requests associated with specific classes. For example, while S3 Glacier Instant Retrieval is ideal for data accessed quarterly, its lower storage price comes with additional retrieval costs. Therefore, for frequently accessed data, it may be more economical to use S3 Standard-IA, which offers both lower storage fees and no retrieval charges. S3 Glacier Instant Retrieval also has a minimum storage duration of 90 days, if you upload data that you intend to delete before this period ends, you will incur a prorated early deletion fee. You could benefit by choosing S3 Standard or the S3 Standard-IA storage classes in such cases.

You should upload directly to S3 Glacier if you know the data is accessed not more than once a quarter to avoid a transition fee. However, it is important to note that, as of today, Cloudera does not support read/retrieval from Glacier. To retrieve the data, it must first be transferred to one of the supported storage tiers.

Implement S3 Lifecycle Policies

A lifecycle policy automatically transitions objects in your bucket from one storage class to another, which is crucial for efficient large-scale data processing. In this context, different layers of data—such as source, raw, processed, and business data—serve distinct functions. Once the data is processed, the source and raw layers are seldom accessed, making it cost-effective to move them to infrequent or archive storage classes. Implementing lifecycle policies can streamline this process, minimizing manual cleanup and associated costs.

You can create lifecycle rules to automatically transition objects to the Standard-IA storage class, archive them to Glacier, or remove them after a specified period. The costs associated with lifecycle transitions correlate directly with the number of objects moved, so consider aggregating or compressing files to reduce this number before archiving.

Apply lifecycle policies to manage datasets that are no longer needed. Additionally, establish policies to clean up incomplete multipart uploads. Organizing your data can further simplify policy implementation; for instance, grouping all source data under a single prefix allows for a unified lifecycle policy rather than managing multiple ones. Using lifecycle policies to delete unnecessary objects rather than calling the delete API to help mitigate S3-delete API costs. Finally, if your bucket is versioned, ensure you have rule actions for both current and non-current objects to either transition or expire as needed.

Choose Appropriate Server-side Encryption Techniques for S3 Data 

Cloudera provides robust support for encryption to ensure data security and compliance. It integrates seamlessly with AWS encryption services, offering multiple options for data protection.

You could leverage Amazon S3’s server side encryption methods, such as Amazon S3-managed keys (SSE-S3), AWS KMS keys stored in AWS Key Management Service (SSE-KMS) or customer-provided keys (SSE-C) to automatically encrypt data at rest.

SSE-KMS is intended for users who require enhanced control over their encryption keys. It’s crucial to note that each request made to Amazon S3 that utilizes KMS incurs a KMS fee along with regular S3 charges, which can lead to escalating costs when dealing with large queries or a high number of GET requests. Additionally, your requests may experience throttling due to KMS limits. To alleviate these costs and prevent throttling, enabling Amazon S3 bucket keys is recommended. Bucket keys significantly reduce the number of transactions between Amazon S3 and AWS KMS, potentially reducing KMS costs by up to 99%.

SurajS_5-1742553201563.jpeg

 

SSE-S3 is the server-side encryption for Amazon S3. There are no additional fees for using server-side encryption with Amazon S3-managed keys (SSE-S3). Each object is encrypted with a unique key, which is then encrypted with another key that is regularly rotated by Amazon S3.

SSE-C allows you to control your encryption key while Amazon S3 handles the encryption during data writing and decryption when you access your objects.

Cloudera supports encryption protocols such as TLS/SSL for securing data in transit, safeguarding data exchanges between clients and Cloudera services.

Leverage S3 Management Tools for Enhanced Efficiency 

Utilize Amazon S3 Storage Lens to gain valuable insights into storage utilization and activity trends at both the organizational and account levels. This tool provides detailed analysis by AWS Region, Storage Lens groups, or prefixes, enabling you to uncover anomalies and identify opportunities for cost efficiencies.

SurajS_6-1742553201583.jpeg

 

SurajS_7-1742553201564.jpeg

 

S3 storage lens

S3 Batch Operations is a managed service that allows you to efficiently manage billions of objects at scale. With this service, you can change object metadata and properties, copy or replicate objects between buckets, replace object tag sets, modify access controls, and restore archived objects from Amazon S3 Glacier—all without the need for coding. There’s no requirement to set up servers, handle partitioning of loads, or worry about S3 throttling. AWS takes care of all these tasks, eliminating compute charges and simplifying your management process.

 

Cost Optimization on RDS 

Several services within Cloudera on cloud, such as the Data Lake cluster and Cloudera Data Hub cluster templates, require a relational database. Typically, these databases are external and provisioned during the initial deployment of the respective service, with AWS RDS instances being created in this context.

Utilize Reserved Instances: RDS Reserved Instances provide substantial savings compared to On-Demand pricing, particularly for one- or three-year terms. As of this writing, AWS does not currently offer a compute savings plan for RDS.

Implement Effective Backup and Snapshot Policies:
  • Backup Retention: Tailor your backup retention period to align with your recovery requirements. Longer retention durations can lead to increased storage costs.
  • Automated Snapshots: Regularly assess and manage your automated snapshots. Ensure you delete unnecessary backups or snapshots that are no longer needed to optimize storage costs.

 

Manage and Optimize Data Transfer Costs

Optimizing data transfer costs is essential for ensuring efficient and cost-effective operations when running Cloudera on AWS. AWS charges for data transfers between regions, availability zones, and the internet, making it essential to architect workloads strategically. By leveraging AWS-native cost optimization features and Cloudera-specific best practices, you can effectively manage transfer costs while maintaining performance and scalability. Consider deploying Cloudera services within a single AZ for non-production workloads to avoid inter-AZ transfer fees.

Optimize Data Transfer within AWS: Keep Cloudera clusters and data sources in the same region whenever possible to avoid inter-region data transfer fees. 

For VPC-to-VPC communication, two common options are VPC peering and AWS Transit Gateways, each with distinct advantages and trade-offs. 

VPC peering creates a direct, private connection between two VPCs, it incurs cross-AZ charges, and requires multiple peering connections to scale across many VPCs, which can increase network complexity in large environments. 

AWS Transit Gateway provides a centralized routing hub that simplifies connectivity across multiple VPCs and on-premises networks. Unlike VPC Peering, which requires individual connections between each VPC, Transit Gateway enables many-to-many communication through a single attachment per VPC. However, it incurs per-hour and per-GB data processing charges, making it more suitable for large-scale architectures that require streamlined routing and network segmentation.

Choosing between VPC Peering and Transit Gateway depends on the scale, complexity, and cost considerations of the network architecture. In some cases, a hybrid approach, combining both VPC Peering and Transit Gateway, may offer the best solution based on specific data traffic patterns and architectural requirements.

Reducing Egress Costs: While data ingress into AWS is free, outbound transfers are charged per GB. To reduce transfer costs, compress data before transfer using formats like Parquet, ORC, or Snappy to reduce data transfer size. For hybrid cloud architectures, consider using AWS Direct Connect, which offers lower-cost, high-bandwidth connections compared to traditional internet-based transfers, making it a cost-effective solution for large data movements.

 

Cost Optimization using AWS Financial Levers: Enterprise Discount Programs and Migration Acceleration Programs

Enterprise Discount Program (EDP): All Cloudera deployments and resources are managed within your AWS account, allowing you to take advantage of discount programs such as the AWS EDP. This program is specifically designed for AWS account owners who commit to substantial, long-term cloud spending. The EDP aims to help organizations achieve sustainable economies of scale, enabling them to derive greater value from their AWS investments as they expand their operations in the cloud. This initiative is particularly beneficial for finance and engineering teams within Cloudera-using organizations, as they seek to optimize their cloud expenditures and enhance overall cost efficiency. 

Migration Acceleration Program (MAP): When migrating existing workloads from on-premises environments or other cloud providers to AWS, you can benefit from AWS MAP initiatives. This comprehensive program is specifically designed to facilitate smooth transitions to the cloud, whether for Cloudera on Premise or Cloudera Distribution including Apache Hadoop. MAP can provide discounts of up to 30% on AWS infrastructure costs for a designated period, based on the resources consumed by the new workloads. This initiative not only eases the migration process but also enhances cost efficiency, making it a valuable option for organizations leveraging Cloudera solutions.

 

Conclusion

Optimizing AWS spend with Cloudera on cloud is an ongoing process that requires a strategic approach. To enhance cost efficiency, take advantage of cloud optimization tools and strategies offered by both Cloudera and AWS. By leveraging AWS services like EC2, S3, and RDS, along with Cloudera’s autoscaling and monitoring capabilities, you can reduce unnecessary expenses while maintaining performance. Using Reserved Instances, Savings Plans, and AWS financial programs, businesses can further optimize cloud expenditures while ensuring data workloads remain secure and scalable.

2,925 Views
Version history
Last update:
‎03-21-2025 04:01 AM
Updated by:
Contributors