Engineering Blogs

GrazittiAPI · ‎03-13-2026

The conversation around cloud adoption has matured significantly. It's no longer a question of if enterprises should use the cloud, but how they can strategically blend public cloud agility with the security and control of their on-premises infrastructure. This hybrid approach is now the dominant strategy for modern data-driven organisations.

Cloud Modernisation Strategies - Migration vs Bursting

Organisations employ two complementary strategies for hybrid cloud adoption: workload migration to cloud and cloud bursting.

While traditional migration involves the permanent relocation of applications and datasets to the cloud for modernisation, cloud bursting dynamically extends a private data center into a public cloud. This provides temporary, on-demand compute to handle demand spikes, scaling back down as capacity needs subside.

These two strategies co-exist. Migration is a long-term approach for modernising to cloud-native workloads, whereas bursting provides immediate compute elasticity for workloads that are retained on-premises, bypassing physical hardware procurement cycles.

Operational Challenges in a Hybrid Estate

Building and operating a hybrid estate introduces its own significant operational challenges.

Simply connecting an on premises data center to a public cloud doesn't create a true hybrid platform. Without a unified strategy, organisations quickly face:

Fragmented Management & Rising Costs: Teams are often forced to use disparate tools and skillsets for different environments, leading to fragmented management, a lack of unified visibility, decentralised cost tracking, and budget overruns.
Overheads in Maintaining Data Copies: Replicating data to the cloud increases costs and creates data staleness. This also complicates access control and auditing, heightening the risk of data leaving regulated boundaries.
Suboptimal Workload Migration: Workloads often require significant re-engineering to function in each new environment. This negates the agility the cloud is intended to provide and prevents a central view for capacity planning.

To solve these problems, a platform must be built on a truly hybrid-native foundation.

Cloudera's Four Tenets of a True Hybrid Platform

At Cloudera, we believe a true hybrid cloud platform must deliver a seamless, unified experience. Our strategy is built on four key tenets:

Unified Runtime: Ensure true workload portability without any rewrites , allowing applications to work and feel the same everywhere.
Hybrid Environments: Provide in-place access to on premises datasets from cloud (AWS, Azure and GCP) deployments to move workloads between form factors without data replication.
Hybrid Control Plane: Offer a single pane of glass for managing all private and public deployments.
Data Security: Deliver centralized security and governance with hardened, out-of-the-box security.

In this blog, we will focus on Hybrid Environments and Data Hub, and how they work to enable seamless extension of on premises infrastructure to cloud.

Architecture: Cloud Migration vs. Cloud Bursting

Before detailing Cloudera Hybrid Data Hubs, it is essential to note the contrast with a “lift-and-shift” cloud migration architecture.

Lift and Shift Architecture: Bring Data to the Cloud

In this model, data and metadata are replicated from the on-premises environment (like HDFS) to cloud storage (like Amazon S3). Processing is then done entirely in the cloud using the replicated data.

Chart2 - color.png While well-suited for replication, when applied to ephemeral cloud bursting, this architecture creates overhead from maintaining multiple data copies, adds complexity in ensuring data synchronization and consistency, and increases storage costs.

The New Hybrid Cloud Model: Bring Cloud to the Data

To natively enable cloud bursting, Cloudera is introducing Hybrid Environments and Data Hubs.

Cloudera Hybrid Environments and Data Hubs combine cloud-native elasticity, including provisioning and autoscaling, with a built-in capability to securely access datasets directly from an associated Cloudera on premises cluster.

To put this into context, a workload (e.g., Spark) submitted to the Data Hub reads/writes data and metadata directly from the associated Cloudera on-premises cluster’s storage (e.g., HDFS), all authorised, audited, and governed by Cloudera SDX.

Chart1 - color.png Cloudera Hybrid Data Hub deployment architecture has the following building blocks:

Unified Authentication: Implementing a two-way Kerberos cross-realm trust between the cloud and on-premises clusters. This enables centralised authorisation and governance.
Default Workload Portability: Maintaining a unified Cloudera Runtime version across both the Hybrid Data Hub and the on-premises cluster, ensuring workloads can move without rewrite.
SDX Link: Associating Cloudera on-premises cluster with the Hybrid Data Hub cluster to serve as its metadata, authorisation and governance context.
Network Connectivity: Ensuring stable, bi-directional connectivity exists between organisation owned on premises and cloud networks to support in-place data read/write operations for active jobs.

What does Cloudera Hybrid Data Hub unlock ?

Hybrid Data Hub allows you to operate with the agility of the cloud while leveraging your existing infrastructure through the following key advantages.

Zero Data Migration: Eliminates the cost and complexity of application redesign and data migration. This enables dynamic workload movement over only planned workload migration.
Centralised Governance: All metadata, access permissions, and governance rules remain centralised on-premises and are enforced consistently, whether the workload runs on-premises or in the cloud.
Workload Portability without Rewrite: Cloudera Unified Runtime (e.g., Cloudera 7.3.1 ) means applications work and feel the same everywhere without re-engineering..

In addition to being a native architecture for cloud bursting, this also unlocks other powerful applications for your business inter-alia:

Strategic Workload Isolation: Maintain critical SLAs by offloading additional workloads to the cloud.
Accelerated Software Development: Create instant development environments that leverage zero-copy data access from your on-prem source

Evaluating Zero Replication (In-place Data Access) Architecture for Bursting On-premises Spark Workloads to Cloud

We now move from theory to practice. While in-place data access eliminates the need for expensive and complex maintenance of persistent data copies for ephemeral cloud bursts, the performance varies based on infrastructure (such as network bandwidth and latency) and the specific workload profile.

We have conducted comparative benchmarking for Spark SQL workloads at enterprise scale for Hybrid Data Hubs to determine viability and discover significant infrastructure and workload factors influencing performance.

Full text of the performance benchmark can be viewed here.

Performance Benchmarking Summary

The benchmarking exercise establishes how network bandwidth, file format, and compression settings affect performance in hybrid cloud environments where compute runs in the cloud and data remains on-premise.

Remote data access is a practical model for burst workloads, but performance is heavily impacted by available network bandwidth.
Columnar file formats (Parquet, ORC) drastically reduce execution time and data transfer compared to CSV, making them a prerequisite for hybrid setups.
Gzip Compression significantly reduces data transfer volume, improving performance under limited bandwidth. Snappy offers minor gains with lower CPU overhead.
Cloud-datacenter Interconnect (eg: AWS Direct Connect, Azure ExpressRoute) bandwidth constraints increase execution time and reduce CPU efficiency for I/O-intensive queries.
Not all queries are impacted equally: CPU-bound queries run efficiently even under constrained bandwidth, while I/O-bound queries degrade sharply.

Overall, the strategic use of columnar formats and compression enables many workloads to run efficiently in hybrid environments, even with limited network capacity.

For CPU-intensive Spark jobs, this setup can be a viable architecture for burst-to-cloud use cases. In contrast, I/O-intensive jobs remain highly sensitive to network limits, making this approach less suitable for data-heavy pipelines without further optimisation.

Next Steps

Get started with Hybrid Data Hub setup to natively burst on-premises workloads to cloud without creating data copies or rewriting applications.

Cloudera Community