About dipankartnt

dipankartnt · ‎05-12-2026

I have been meaning to explore Apache Polaris for some time. Over the years, I have worked with a range of Apache Iceberg catalogs - Nessie, Hive Metastore, AWS Glue, Unity Catalog, and each of them reflects a different set of design trade-offs around metadata management, security, and interoperability. What makes Polaris particularly interesting to me is not just its technical direction, but the fact that it is evolving as a community-led project under the Apache Software Foundation. If you have followed my work, you will know I tend to approach systems by first understanding how they are built rather than how they are used. That is the intent here. We will examine Polaris from the inside out - its system boundaries, core building blocks, and the design principles that shape its behavior. The goal is to understand Polaris from a system design perspective so we develop the depth needed for using these systems. Let’s go! System Boundaries Apache Polaris is a networked metadata service that implements the Iceberg REST Catalog specification while extending it with additional control-plane capabilities. It sits between compute engines and open table formats (such as Apache Iceberg) stored on storage systems, governing how metadata is interpreted, accessed, and evolved. To understand its behavior, let’s decompose the system along its external and internal boundaries. Here is a diagrammatic breakdown to follow along. 1. API Surface At the outermost boundary, Polaris exposes a set of HTTP APIs that define how external systems interact with it. These APIs fall into two families. The catalog API surface implements the Iceberg REST specification and is used by engines such as Spark, Flink, and Trino to perform table operations like creating tables, committing snapshots, or resolving metadata locations. Alongside this, Polaris exposes a management API surface that operates on principals, roles, catalogs, and other administrative constructs. While these APIs differ in semantics, they are not separate systems. Both ultimately converge on the same execution path, which becomes important when reasoning about consistency, authorization, and failure modes. 2. Realm All requests in Polaris are evaluated within the context of a realm, which acts as the primary isolation boundary. A realm defines a logically independent namespace for metadata, configuration, and policy. When a request enters the system, it is bound to a specific realm, and all subsequent operations, i.e. metadata resolution, authorization checks, and persistence are scoped to that context. This allows a single Polaris deployment to host multiple tenants with strict separation, not just in metadata, but also in access control and operational behavior. This concept is central to Polaris. Without understanding realms, it is difficult to reason about multi-tenancy, security boundaries, or even how catalogs are logically partitioned. 3. Runtime Layer The runtime layer is implemented as a Quarkus-based server that hosts the REST endpoints and orchestrates request execution.It is responsible for parsing incoming requests, routing them to the appropriate handlers, and managing lifecycle concerns such as request context and dependency injection. However, it does not implement catalog semantics or metadata logic directly. Instead, it delegates all meaningful work to the domain layer. This separation ensures that the transport layer remains decoupled from the system’s core logic, allowing Polaris to evolve its execution model independently of how requests are exposed or delivered. 4. Domain Layer Beneath the runtime sits the domain layer, which contains the core components responsible for catalog logic, RBAC enforcement, entity modeling, and storage interaction policies. Components such as the metadata manager and credential scoping mechanisms are defined within this layer. The domain layer hosts the abstractions for catalog structure, authorization, and storage interaction, and acts as the execution target for requests delegated by the runtime. 5. Persistence Layer Polaris persists its internal catalog state through a pluggable persistence layer that provides storage for entities such as catalogs, namespaces, principals, roles, and configuration. This abstraction allows Polaris to use different backends, such as JDBC or NoSQL systems, without coupling the system to a specific storage implementation. 6. External Systems Polaris interacts with several external systems that lie outside its direct control. On the storage side, Polaris integrates with object stores such as Amazon S3, Google Cloud Storage, or Apache Ozone. Polaris does not participate in data reads or writes during query execution. Instead, it integrates with these systems to validate configured locations and to facilitate access patterns used by compute engines. On the identity side, Polaris integrates with authentication systems such as OAuth/OIDC providers. These systems establish the identity of the caller, which is then used within the domain layer to enforce authorization policies. 7. Extension Points Polaris also exposes extension points that allow it to integrate with external systems. These include authorization plugins such as OPA and federation layers that enable interoperability with existing catalogs like Hive or Hadoop-based systems. These extensions do not alter the core execution path, but they expand the environments in which Polaris can operate, allowing it to function as part of a broader, heterogeneous metadata ecosystem. Core Building Blocks Now that we have gone over the system boundaries, the next step is to look at how Polaris actually behaves under real operations. The components described so far, i.e. API surfaces, runtime, domain logic, persistence, and external integrations do not operate in isolation. Every request that enters Polaris, is evaluated, transformed, and executed through these layers in a consistent way. The boundaries describe where different parts of the system reside, but they do not explain what actually drives behavior. Every request ultimately depends on a small set of abstractions that are consistently applied across operations. These abstractions determine how state is represented, how access is enforced, how changes are persisted, and how those changes are applied to Iceberg tables. In the next section, we focus on these core building blocks. Entity Model - The Catalog State At the center of Polaris is an internal entity model that represents the catalog as a structured graph of objects. This model includes catalogs, namespaces, and table-like entities, as well as principals, and two kinds of roles - catalog roles (scoped to a catalog), and principal roles (scoped to the account). Table- and namespace-level privileges attach to catalog roles. Catalog roles are granted to principal roles, so a principal role bundles access across one or more catalogs. Principal roles are granted to principals, which are the users or service identities that call the APIs. Entities are arranged hierarchically, with a root-level structure that organizes catalogs and their contained namespaces and tables. Each entity type (PolarisEntityType) has a well-defined role: namespaces (NamespaceEntity) group tables catalogs (CatalogEntity) define storage and policy boundaries table-like (IcebergTableLikeEntity) entities represent concrete Iceberg tables, views, or other table-like constructs This entity graph is the logical state of Polaris. All API operations, whether they originate from Iceberg REST calls or management interfaces, ultimately translate into reads or mutations of this graph. It is important to distinguish this state from Iceberg’s own metadata. The authoritative table and snapshot history live in Iceberg metadata files; Polaris may cache a small set of denormalized fields (for example, pointers and summary ids) on entities such as IcebergTableLikeEntity for catalog operations, but it does not store full schemas or snapshot histories as the system of record. It primarily maintains references to those structures while managing the higher-level organization and governance of tables. Authorization - Enforcement Layer Every operation in Polaris is evaluated against an authorization system before it is allowed to proceed. This system is built around the concept of an explicit operation abstraction, where each request is mapped to a well-defined action (for example, creating a table or loading metadata). These operations are checked against privileges defined in the entity model. In the default configuration, Polaris uses a role-based access control (RBAC) model, where privileges are granted to roles and roles are assigned to principals through the two role layers (catalog and principal) described in the above section. The evaluation determines whether the requesting principal PolarisPrincipal is allowed to perform the given operation on the target entity. The design separates the notion of an operation from the underlying privilege model. This allows Polaris to support alternative authorization mechanisms, such as delegating decisions to external policy engines, without changing the semantics of API-level actions. Regardless of the implementation, authorization is always on the critical execution path, i.e. no request can reach the underlying metadata or Iceberg layers without first passing through this check. Persistence Polaris maintains its entity model through a persistence abstraction that separates logical operations from the underlying storage implementation. At the lowest level, BasePersistence (and related) interfaces provide atomic read and write capabilities over entities - for example, writeEntity with compare-and-swap style expectations. This interface is implemented by concrete backends, which may use relational databases or NoSQL systems. Above this interface sits a higher-level component responsible for orchestrating entity lifecycle operations, such as creation, updates, and deletion, as well as managing related concerns like grants and policies mappings. The PolarisMetaStoreManager interface is that metastore manager, providing a unified entry point for all catalog state mutations. This separation is important for two reasons. First, it ensures that the core logic of Polaris is not tied to a specific storage technology. Second, it allows Polaris to treat entity-level operations as atomic and compare-and-swap–friendly; backends implement that using their own transaction or batching rules (for example, JDBC uses database transactions). The persistence layer stores catalog state, not Iceberg table metadata. In practice, this means it records information such as the existence of a table entity, its namespace path, and the location (metadata-location) of its Iceberg metadata file, but not the contents of that metadata. Storage Integration A storage integration layer governs how tables map to object storage and how access to that storage is controlled. Each catalog (and optionally namespaces in the hierarchy) can carry storage configuration that defines allowed locations and access patterns. When operations involve interaction with storage - such as creating a table or loading metadata, Polaris validates that the specified locations conform to catalog-level constraints. In addition, it may generate scoped credentials that grant compute engines temporary access to specific paths in object storage. This layer is crucial in multi-tenant environments. It ensures that access to data is constrained not only by logical permissions but also by physical boundaries in storage.By mediating credentials and validating paths, Polaris enforces data isolation at the storage level. It does not act as the data plane for query execution on table files - clients use vended credentials to reach object storage directly, while Polaris may still read or write Iceberg metadata as part of catalog operations. Iceberg Execution Bridge The final and most critical building block is the execution bridge between Polaris and Iceberg. Polaris does not implement its own table format or metadata protocol. Instead, Polaris’s IcebergCatalog (Iceberg catalog implementation) delegates table and view operations to Iceberg’s APIs, using them as the authoritative mechanism for managing table state. When a request such as table creation, view creation, or schema evolution is processed, IcebergCatalog translates the high-level intent into calls against Iceberg’s catalog and table or view abstractions. These calls ultimately trigger the Iceberg commit protocol, which produces new metadata files and updates snapshot or version lineage in object store (for tables) or the corresponding view metadata (for views). This bridge is where Polaris moves from a control-plane service to an orchestrator of table and view metadata transitions. Requests are authorized and validated in the catalog context early in the path; execution then goes through Iceberg’s catalog (IcebergCatalog) and TableOperations (or ViewOperations) contract, which writes new table metadata in object storage and runs Iceberg’s commit protocol. After a successful commit, Polaris records its own view of the table, i.e. the entity identity, metadata-location , and selected denormalized fields, in the persistence layer. The separation between Polaris state and Iceberg state is essential. Iceberg remains the source of truth for table structure and evolution, while Polaris governs how and when those transitions occur, and how they are exposed to different users and systems. Execution Model of Polaris The system boundaries describe where different components reside, and the core building blocks define the abstractions that govern behavior. However, neither fully explains how a request is executed end-to-end. In practice, every request in Polaris follows a consistent execution model, regardless of whether it originates from the Iceberg REST API or management interfaces. The differences between operations arise not from distinct execution paths, but from how each step interacts with catalog state, Iceberg metadata, and storage. At a high level, request execution in Polaris can be described as a sequence of transformations: The earlier building blocks map directly onto this flow: The realm defines the execution context The authorization layer validates the operation The entity model provides the logical catalog state The Iceberg execution bridge performs metadata transitions The persistence layer records catalog state The storage integration layer mediates access to object storage This execution model is invariant across operations. What changes is whether an operation terminates at the catalog layer, or proceeds into Iceberg metadata and storage. In the next section, we apply this model to three representative operations. Execution of Common Operations in Polaris The execution model defined earlier provides a uniform lens through which all Polaris operations can be understood. In this section, we apply that model to three representative operations, each illustrating a distinct interaction pattern with catalog state and Iceberg metadata. Let’s go over these three operations: LIST_TABLES, a metadata-only catalog read CREATE TABLE, an operation that initializes new table state COMMIT_TRANSACTION, the primitive that governs atomic evolution of existing tables LIST_TABLES The LIST_TABLES operation represents the simplest execution path among the common catalog operations in Polaris. It is a catalog read that does not require interaction with Iceberg’s metadata commit protocol or object storage. Instead, it is resolved entirely within Polaris’ own catalog state. The request enters through the Iceberg REST catalog API (API Surface) and is immediately bound to a Realm-specific context, ensuring that all subsequent execution is scoped to the correct tenant. At this stage, the Runtime layer performs minimal processing and delegates execution to the appropriate catalog handler. Before any data is accessed, the request is evaluated by the Authorization layer. The operation is mapped to a well-defined action (such as listing tables within a namespace), and the system verifies that the requesting principal holds the necessary privileges on the target namespace. This step is mandatory and ensures that even read-only operations are subject to policy enforcement. Once authorized, the system resolves the target namespace within the Entity model. The namespace acts as a node in the catalog graph, and the operation retrieves all table-like entities associated with it. This retrieval is performed through the Persistence abstraction, which accesses the underlying storage (e.g., relational or NoSQL backend) via the metastore manager. At no point does this operation invoke the Iceberg execution bridge (to load table metadata files) or interact with object storage. The result is constructed entirely from Polaris’ own catalog state, and the response is returned to the client. An important point to understand here is that for operations like LIST_TABLES, Polaris does not discover tables by reading Iceberg metadata files in object storage (e.g., metadata.json). Instead, it enumerates persisted catalog entities maintained within its own catalog state. It maintains an independent catalog representation that can satisfy a class of operations without invoking Iceberg’s metadata layer. In doing so, it behaves as a true metadata service, not just a pass-through interface. CREATE TABLE Among table lifecycle operations, CREATE TABLE is the first point where Polaris must coordinate between its internal catalog state and Iceberg’s metadata model. Unlike LIST_TABLES, this operation involves both catalog mutation and metadata initialization, requiring interaction with storage and Iceberg’s commit protocol. The request again enters through the Iceberg REST API (API Surface) and is immediately scoped to a specific realm, ensuring tenant isolation. The runtime layer performs initial validation and delegates execution to the catalog handler, where the operation is interpreted in terms of Polaris’ domain model. The following section describes the direct creation path, where the table is immediately committed and registered. Staged creation follows a different execution path and is not covered here. The first step in execution is authorization. The system verifies that the requesting principal has permission to create a table within the specified namespace. This evaluation is performed against Polaris’ RBAC model or any configured external policy system. In addition, Polaris enforces constraints at the catalog level - for example, rejecting creation attempts on static or federated catalogs that do not support mutation. Following authorization, Polaris performs validation within the entity model. This includes verifying that the namespace exists and ensuring that no table with the same identifier already exists. Through these checks, the operation remains catalog-local. However, this phase does not fully capture all validation, as storage-backed checks, such as location validation and allowed-prefix enforcement are performed later as part of the commit path. The next phase transitions execution into the Iceberg execution bridge. Polaris translates the validated request into calls against Iceberg’s catalog and table abstractions. This involves constructing a table builder, configuring it with schema, partition specification, and properties, and invoking the create operation. This invocation triggers Iceberg’s metadata commit protocol. A new table metadata object is constructed and serialized into a metadata file, which is written to object storage. This step relies on the storage integration layer, which ensures that the target location is valid and that appropriate credentials are available. Polaris may generate scoped credentials to enforce isolation while allowing the write to proceed. Once the metadata file is successfully written, Polaris performs catalog registration through the persistence layer. The table is inserted into the entity model, recording its identifier, namespace, metadata location, and selected denormalized fields required for catalog semantics. This makes sure that subsequent operations can resolve the table directly through Polaris without recomputing its state.The response returned to the client includes the table’s metadata representation and, if applicable, access credentials. This operation highlights the dual-state nature of Polaris. Iceberg metadata files remain the authoritative source of truth for table structure, schema, and evolution, while Polaris maintains a catalog-level representation that references this state and augments it with governance and access control information. These two states are updated in sequence and since these steps span storage and catalog persistence, consistency across them depends on successful completion of both phases. COMMIT_TRANSACTION While CREATE TABLE initializes table state, the ongoing evolution of tables is governed by the COMMIT_TRANSACTION operation. This is the core mutation primitive exposed by Iceberg and mediated by Polaris. Unlike CREATE TABLE, which operates on a single table, a commit transaction request may include updates for one or more tables within a single HTTP call, each of which is processed and committed independently. The request enters through the Iceberg REST API (API Surface) and is scoped to a specific realm, ensuring tenant isolation. The runtime layer parses the request and delegates execution to the transaction handler, which interprets the operation as a set of table-level updates. Before any mutation is applied, Polaris evaluates authorization. The system verifies that the requesting principal holds sufficient table-level privileges for each target table, as defined by the operation semantics. This ensures that all updates within the transaction are permitted before execution proceeds. Once authorized, Polaris resolves each target table through its entity model, retrieving the current metadata location and associated configuration required to interact with Iceberg. This step binds Polaris’ catalog state to the corresponding Iceberg table state for every table included in the request. Execution then transitions into the Iceberg execution bridge. For each table, Polaris applies the requested updates sequentially and invokes Iceberg’sTableOperations.commit as implemented by Polaris (BasePolarisTableOperations). This commit path performs validation, detects conflicts, and constructs a new metadata state, which is serialized into an updated metadata file and written to object storage via the storage integration layer. These commits are executed per table and rely on Iceberg’s compare-and-swap semantics to ensure metadata-level atomicity. A key property of this phase is that atomicity is scoped to individual tables. Iceberg guarantees that each table’s metadata transition is applied atomically only if the commit succeeds without conflicts. Polaris relies on this mechanism for metadata correctness and does not implement a separate concurrency model for on-storage state. However, Polaris does maintain its own catalog state, and coordination between these layers is handled explicitly. After all table-level commits are attempted, Polaris applies updates to its catalog state through the persistence layer. Rather than updating entities individually, these changes are accumulated and flushed in a conditional batch operation, ensuring that catalog entries remain aligned with the committed metadata. This particular update reflects changes such as metadata locations and selected denormalized fields maintained for catalog semantics. As introduced in the CREATE TABLE flow, Polaris maintains a separation between Iceberg metadata state and its own catalog representation. In the context of COMMIT_TRANSACTION, this separation becomes more pronounced, with Iceberg handling atomic metadata evolution at the table level, and Polaris aligning its catalog state to reflect those committed changes. Together, this coordination ensures that table evolution remains consistent across both metadata layers while preserving Iceberg as the source of truth for table state. Federation in Polaris Polaris is typically introduced as a catalog that manages Iceberg table metadata through its own persistence layer. However, not all catalog states need to originate within Polaris itself. In many deployments, metadata already exists in external systems, such as another Iceberg REST catalog or a Hive Metastore (HMS). Rather than requiring migration, Polaris introduces federation as a mechanism to integrate with these systems while retaining control over access, policy, and execution semantics. At a high level, federation in Polaris follows a broker pattern. Polaris continues to own the catalog identity within its entity model, including namespaces, RBAC bindings, and catalog-level configuration. However, when a catalog is configured with a connection to an external backend, the responsibility for metadata resolution and mutation is delegated to that system. Polaris acts as an intermediary, routing requests through its standard execution model while deferring table-level operations to the remote catalog. As of today, there are two forms of federation: In the case of Iceberg REST federation, Polaris forwards requests over HTTP to a remote service that implements the Iceberg REST specification. For Hive Metastore, federation is enabled through an optional runtime extension, allowing Polaris to interact with HMS as a catalog backend. In both cases, the external system remains the source of truth for table metadata, while Polaris acts as a control layer. Conclusion By examining Polaris’s system boundaries, building blocks, and execution model, we see that it is not just a catalog implementation, but a control plane for metadata systems. Across all operations, Polaris maintains a clear separation of concerns. The entity model defines the logical structure of the catalog, the authorization layer governs access, and the persistence layer records catalog state. The Iceberg execution bridge handles metadata transitions, while the storage integration layer enforces physical access boundaries. These components operate together through a consistent execution model, regardless of whether the operation is a simple catalog read or a metadata commit. A key design principle that surfaces throughout is the separation between catalog state and table state. Polaris does not attempt to own or replicate Iceberg metadata. Instead, it maintains references to that state while governing how it is accessed and evolved. In that sense, Polaris moves beyond the role of a catalog implementation. It is a system that standardizes how metadata systems are composed and governed. If you are interested in reading deeply technical blogs like this, make sure to join Cloudera Community and learn what are we building for developers in our Developer Hub.

dipankartnt · ‎03-02-2026

A few days back we co-hosted a Lakehouse meetup in New York City with Cloudflare and LanceDB, bringing together folks from the Apache Iceberg, Apache DataFusion, and Lance communities. At Cloudera, we view supporting and running open source meetups as critical drivers for the ecosystem itself and we have done this for a while now. Open source projects like Iceberg, DataFusion, and Lance evolve through real-world feedback, and those conversations happen best when practitioners are in the same room. These meetups create that perfect space, where design decisions are debated, trade-offs are brought in, and implementation realities are discussed openly. This event was a good example of that! We had three dedicated talks targeting these three projects and opened up the rest of the time for community networking. Iceberg Spec Evolution - v1 to v4 & how Cloudera supports I kicked off the evening with a session on Apache Iceberg’s specification evolution. We walked through how specs v1 and v2 addressed foundational table abstraction and row-level operations, then spent more time on v3 and the ongoing v4 proposals. We spent time discussing: Binary Deletion Vectors: Moving beyond positional deletes to binary bitmaps. This is a game-changer for write-heavy workloads, significantly reducing the I/O overhead of row-level updates. Row-Level Lineage: The introduction of stable row identifiers in v3, which finally makes true Change Data Capture (CDC) and incremental processing feel native to the table format. The V4 Horizon: Touched on the active proposals for v4, including single-file commit to simplify metadata writes and tackle metadata bloat issues, using Parquet for metadata (replacing Avro) to allow for columnar metadata reads. This would let engines skip even more data by only loading the specific metadata fields they need for a query plan, among other things. The focus was about understanding why each change was introduced and how it affects developers building lakehouse pipelines at scale. A lot of the discussion centered around practical implications: metadata growth, mutation patterns, execution behavior, and how these changes surface in real deployments. We also touched on how Cloudera has supported Iceberg’s core capabilities from early on - and what it means to support spec evolution inside a production platform. We have navigated the transition from the early v1 specs, and now as the community pushes into the v3/v4 frontier, our focus remains on making those powerful new capabilities - like deletion vectors and row-level lineage available. The questions from the room reflected that people are actively thinking about these new opportunities and about Iceberg’s adoption in data platforms. Cloudflare’s Data Platform: Iceberg + DataFusion Jonathan Chen from Cloudflare then introduced their new data platform built on R2, R2 Data Catalog, R2 SQL, and Pipelines. The architecture combines object storage, Apache Iceberg as the table layer, and Apache DataFusion as the query engine - enabling ingestion and SQL analytics directly over object storage. Jonathan explained how the R2 SQL engine (built on Apache DataFusion) uses a scatter-gather architecture to run analytics directly on Iceberg tables stored in R2 and how the engine can now handle aggregations (SUM, COUNT, etc.) and complex JOINs without the data ever leaving the Cloudflare network. It’s a compelling look at a "no-infra" future where the storage itself is smart enough to answer your SQL queries. Multimodal AI Lakehouse with Lance Chang She (Co-founder & CEO of LanceDB) brought a different flavor to the conversation: the multimodal challenge. While Iceberg handles our analytical tables, how do we handle the billions of embeddings and blobs required for modern AI? The next wave of AI (think Midjourney, WorldLabs, and Runway) requires seamless, scalable access to much more than just numbers and strings. We’re talking about text, images, embeddings, and complex modalities. Chang introduced Lance, a columnar data format optimized specifically for AI, and LanceDB, the multimodal lakehouse built on top of it. The highlight was also the new "branch/tag" capability. It essentially functions as "Git for AI Data," allowing data scientists to create zero-copy clones of massive datasets for experimentation. This means you can branch a production dataset, run a fine-tuning job or a transformation experiment, and then either merge or discard it. Final Thoughts If you have been to a great meetup, you know the schedule is only half the story. After the talks, a lot of folks came up to continue the conversation - especially around what’s coming next in Apache Iceberg and how to think about the upcoming spec direction. There were also good side discussions sparked by the Lance session, particularly around multimodal workloads and how teams are starting to think about vectors, text, and images alongside tabular lakehouse tables. This is exactly why we at Cloudera Community care about these meetups. The value isn’t only the presentations - it’s the direct, candid conversations that happen around them, across projects and across communities. It goes without saying that an event like this doesn't just happen. This meetup has been a long time coming, growing from early discussions between myself, Prashanth, ChanChan, and Jonathan about the need for a dedicated data infra meetup in the city. A massive thanks to the Cloudflare team for providing such a great space, and a huge shoutout to Cloudera, LanceDB, and Cloudflare for their support in making this a reality. Join us at Cloudera Community to keep a track of all the meetups/events.

dipankartnt · ‎02-26-2026

This blog is written in collaboration with Alex Merced - Head of DevRel @Dremio There was a time when talking about “Open table formats” would get you a polite nod… and then the conversation would move on. This was of course before “lakehouse” became a strategy slide in every vendor or enterprise deck. And before modular data architectures were a mainstream discussion. Back then, the idea that storage format layers should be open felt abstract and theoretical. Maybe even non-relevant in the greater scope of things! We were told, more than once: “Why does this matter?” “Isn’t Parquet enough?” “We already have a Data Warehouse & Data Lake” “Just use Delta” And to be fair, those weren’t unreasonable questions at the time. Apache Iceberg wasn’t a new query engine. It wasn’t a new database or a shiny AI model. It was basically metadata in a more technical sense. And an open table format specification. Unfortunately, these things have never been in vogue - until users started seeing what it truly means! The Problem Nobody Thought Was a Problem One of the interesting things about the early Iceberg conversations is that most people didn’t believe there was a structural issue in the first place. On one hand, cloud data warehouses were widely adopted and became that centralized repository for structured BI workloads and on the other hand, data lakes with Apache Parquet as the file format and Apache Hive as the table format became the standard for serving AI use cases. Databricks Delta Lake (proprietary version of the Delta Lake table format) was also in use by customers that were already in the Databricks (or Azure Databricks) ecosystem. From the outside, things seemed fine. So when we began advocating for an open table format with a formal specification, and spoke about things like snapshot isolation, structured metadata trees, and partition evolution, the reaction was often confusion rather than resistance. Many people simply didn’t see what we wanted them to see. And that was because the pain rarely showed up as “our table abstraction is flawed.” It actually showed up in much more practical, day-to-day complaints. Things like - Our Spark job is slow Listing partitions on S3 takes forever We can’t change partitioning without rewriting everything Schema evolution broke our downstream job Two jobs wrote to the same table and now we have corrupted data These were treated as performance problems, operational mistakes, or scaling challenges. Teams would add scripts. They would add locks. They would document rules explaining how to safely evolve schemas or manage partitions. Over time, those workarounds hardened into architectural debt. One of the main reasons for these issues was that a “table” was never a first-class abstraction in data lakes. It was an agreement layered on top of the file system. You had directories containing Parquet files, and a Hive table pointing to that directory. Table semantics were effectively determined by the engine that interpreted them. When datasets were small, listing files from the file system was tolerable and when only one engine wrote to the table, behavioral assumptions held. But as volumes increased and workloads diversified, the cracks began to show. For engineers working with cloud data warehouses, these issues didn’t matter as the storage layer has always been abstracted by vendors with proprietary file and table formats and maintenance was taken care of by the warehouse’s cleaner. The most common problem for these users was - data getting locked into the vendor systems and the super high costs that came with it. Iceberg’s proposition was not about just faster queries or shinier features. It was about formalizing the table format as an independent layer with an open specification for all compute engines to abide by. This open specification ultimately became the solution to the problems we discussed. The Early Struggles If the technical argument behind Iceberg was subtle, the real challenge was cultural. We were not introducing a solution that replaced something obvious. We were introducing a new idea that required people to rethink their storage layers and the impact they have. Unfortunately, that kind of shift does not happen through just feature highlights or benchmarks - it requires changing mental models. During that time, most engineers did not wake up thinking, “I need an open table format.” Their goal was to ship pipelines, optimize jobs, reduce storage costs, or stabilize production workloads. The storage layer was not top of mind. So when we started speaking about formal specifications, snapshot isolation, and metadata trees, we were effectively asking people to care about the invisible foundation beneath their existing systems - the historically abstracted stuff that we mentioned before. We realized fairly quickly that education was going to be a crucial aspect for engineers to understand the need for something like Iceberg/Open Lakehouse. We had to explain what a table abstraction actually means in a data lake, why relying on proprietary storage systems brings long-term constraints, and why modularity and openness was critical. From Iceberg’s technical architecture POV, we had to explain how the metadata tree structure brings snapshot isolation, how hidden partitioning worked, and why concurrency control models were imperative to running multiple workloads on the same table. Note that these were not surface-level topics. They required long-form explanations, breaking down complex things in the form of diagrams, and showing how specific use cases can be implemented with Iceberg. At conferences, the reception was thoughtful but measured. The questions were, sometimes skeptical, but often practical: How does this compare to Hive? How is this different from Delta? Are we introducing another layer? Oftentimes, half the effort was simply clarifying what Iceberg was and what it was not. But those questions were important as they forced us to sharpen the articulation of the value proposition. How we Evangelized Iceberg? In the early days of Apache Iceberg adoption, formal evangelism around the project was almost entirely driven by the engineers building it. Outside of committers and contributors, there was little dedicated effort focused on education, storytelling, or community-building around Iceberg as a standalone technology. That began to change when Alex Merced joined Dremio in December 2021, followed by Dipankar Mazumdar in February 2022 as the first Developer Advocates with a primary focus on Apache Iceberg. What followed was an organic and formative period of experimentation. There was no established playbook for how to “evangelize” a table format. Instead, advocacy meant learning in public, translating deep technical concepts into practical guidance, and showing up consistently wherever the data community gathered. Once it became clear that this would be a long-term effort, the approach became deliberate. If the table format abstraction was unfamiliar, we had to make it understandable. If the ecosystem lacked vocabulary, we had to build it. So, we organized our efforts around a few core pillars. Foundational Blogs & Hands-on Exercises The first pillar was long-form technical writing. This was deliberately targeted at explaining the core architectural concepts of Iceberg, how it fundamentally worked, and how it compared with other formats at the time. We wrote deep technical blogs explaining how Iceberg looked under the hood, and how read and write queries worked. We unpacked how hidden partitioning helped with fewer accidental table scans. We went over Iceberg’s key features and why Puffin file format was introduced and was necessary for additional statistics that could help with performance improvements. But we quickly realized that reading alone was not enough and engineers needed to see and run things. So alongside the blogs, we built hands-on exercises. This repository became a practical companion to the writing - a place where readers could experiment with table creation, schema evolution, partition changes, and understand the internal behavior themselves. Webinars/Podcasts The second pillar was live education. We ran webinars/podcasts focused on the mechanics of Iceberg. These provided space for live demos, deeper dives, and real-time Q&A, often revealing where understanding of Iceberg was still unclear. Early sessions were as much learning experiences for presenters as for attendees. Much of the time was spent dissecting real questions about how the system behaves under scale, concurrency, and multi-engine access. Over time, these sessions became less about explanation and more about application. Engineers began arriving with their own workload patterns and architectural constraints, asking how Iceberg would behave in their system rather than in a generic example. That evolution led us to introduce dedicated office hours. Instead of one-to-many presentations, we created open forums where engineers could bring specific production scenarios, performance issues, or how-to questions. The goal was to reduce friction for real adopters and make the abstraction practical, not theoretical. Those office hours became one of the most important feedback loops. They exposed edge cases, clarified documentation gaps, and often influenced how we explained Iceberg going forward. Conferences & Ecosystem Conversations Conferences played quite a different role than blogs or webinars. We were showing up in places where Iceberg (or Table Format) was not yet the dominant topic.Our talks often required contextual framing before diving into technical details. We had to explain the problem space before explaining the solution. That meant sessions focused as much on open lakehouse architecture as on Iceberg itself. But conferences were not just educational moments - they were ecosystem checkpoints. These events brought together practitioners from different organizations who were solving similar problems in parallel. So, naturally, the conversations moved beyond “How does this feature work?” toward “How are you running this in production?” and “What are you doing for compaction at scale?” So, we were onto the next level, and instead of debating whether the table abstraction was needed, engineers were comparing operational strategies. Conferences actually became a place where ecosystem alignment formed in public. Different vendors, contributors, and adopters were discussing the same specification, the same semantics, and the same trade-offs. It reinforced the idea that Iceberg was not tied to a single company’s roadmap but was evolving as an independent specification. Books and Research Paper By this point, we had already produced a significant body of written material - deep technical blogs, architectural breakdowns, hands-on guides. All of these were amazing to see but we wanted to consolidate these into something more formal and durable. One outcome of that consolidation was publishing formal research work, including the paper “The Data Lakehouse: Data Warehousing and More”. It was a way to capture the “why” behind the open lakehouse paradigm and compare it to traditional database systems like data warehouses. In parallel, there was also a clear need for something even more practitioner-oriented and complete: a definitive reference that engineers could keep on their desk. That’s where the work on Apache Iceberg: The Definitive Guide came in. A book is quite different from everything else. It forces you to organize the subject end-to-end: table structure, metadata layers, table operations, performance patterns, and production practices. Community Building & Public Collaboration We want to stress that while the blogs, talks, and other educational materials helped explain Iceberg to the masses, the real inflection came from the diverse community that began forming around it. Iceberg was never positioned as a vendor-owned format and its specification evolved in public. Committers from different organizations made proposals in the open. Integrations across engines & systems were built by contributors with different priorities and production realities. The mailing lists, Slack channels, and conference hallways became places where real-world lessons were exchanged. And our role in that process was to amplify these narratives and connect it. By consistently highlighting community contributions, helping users on Iceberg Slack, inviting committers and adopters to speak in webinars and conferences, and creating spaces like office hours for open discussion, we tried to make participation visible and accessible. But the exchange was never one-directional. Those conversations sharpened our own understanding. For e.g. the office hours brought problems we hadn’t considered, slack discussions revealed documentation gaps, and talks by external adopters brought real production constraints into the spotlight. The more the community engaged, the more grounded and precise our messaging became. Closing Reflections Looking back, the early days of Apache Iceberg evangelism were defined by experimentation, curiosity, and a strong belief in open standards. Without a clear roadmap, advocacy evolved through blogs, webinars, events, books, and countless conversations with the data community trying to make sense of a rapidly changing data landscape. What began as a small effort to explain a new table format grew into sustained engagement with a global lakehouse community. If you are interested in continuing to learn more about Apache Iceberg from Dipankar and Alex, follow these channels: Blogs - Cloudera Community Cloudera Developers Playlist The Dremio Blog Below this post, you’ll find a list of published works that Alex and Dipankar have been part of over the years. We encourage you to explore these writings, talks, and recordings to see how the ideas around Apache Iceberg and the lakehouse have evolved through that work. Apache Iceberg: The Definitive Guide Engineering Data Lakehouse with Open Table Formats The Data Lakehouse Paper: Data Warehousing and More Apache Polaris: The Definitive Guide Architecting an Apache Iceberg Lakehouse The Apache Iceberg Digest Vol. 1 The Book on Using Apache Iceberg with Python The Book on Agentic Analytics awesome-lakehouse-guide Github Repository

dipankartnt · ‎01-27-2026

Open Lakehouse architecture brings in the modularity and flexibility needed to run multiple analytical workloads on top of a single source of data powered by open table formats like Apache Iceberg. Rather than relying on a tightly coupled, monolithic system, it allows teams to compose their data platform from independent building blocks - storage, table formats, catalogs, and compute engines, each selected based on specific workload and operational requirements. Storage is one of the key components of a lakehouse architecture. It is the foundation on which table formats implement transactional semantics, ACID guarantees, metadata management, and enable multi-engine interoperability. Decisions made at the storage layer directly influence how reliably tables can evolve, scale, and be shared across systems. Apache Iceberg: Database semantics on Object storage Apache Iceberg implements ACID properties on top of object storage and provides a schema to refer to the data files (e.g. Apache Parquet) as a “table.” At its core, Iceberg separates - physical data storage (immutable Parquet/ORC/Avro files) and logical table state (schemas, partitions, snapshots, file-level statistics). This separation is fundamental to how Iceberg behaves and enables multiple compute engines to work together on the same table at the same time. Instead of relying on directory layouts or file naming conventions, Iceberg maintains explicit metadata files that describe: Which data files belong to the table How they are partitioned Which snapshot represents the current table state How the table evolved over time As tables evolve, both metadata and data files grow in number, making the behavior and scalability of the underlying object storage a first-class concern for Iceberg deployments. Apache Ozone: Object Store for Lakehouse Apache Ozone is designed with specific requirements in mind. It is a highly scalable, distributed object store built to support growing data volumes while also handling large numbers of smaller objects without exhausting metadata capacity. This makes it well-suited for lakehouse systems where table formats continuously generate new data and metadata artifacts as tables evolve. What does Apache Ozone bring to the “Table”? Before diving into the hands-on workflow, it’s worth briefly summarizing what Apache Ozone brings as the storage system. Open-source, cloud-native object and file storage Apache Ozone is an open-source, cloud-native storage system designed to scale to billions of objects and hundreds of petabytes. Its architecture is built for distributed deployments, making it suitable for large analytical platforms where storage growth is continuous and long-lived. Dual-access semantics: object and filesystem APIs Ozone supports native S3-compatible access for modern data platforms while also exposing traditional filesystem semantics (OFS). This dual-access model allows the same underlying data to be accessed via both ways, enabling gradual migration and mixed workloads within a single deployment. Strong consistency without a centralized metadata bottleneck Ozone provides strong consistency guarantees while avoiding the traditional NameNode bottleneck by fully decoupling metadata management and storage. The metadata plane (Ozone Manager) and storage plane (Storage Container Manager) operate independently and are coordinated using Apache Ratis, a Raft-based consensus protocol. This design enables scalable metadata operations without sacrificing correctness. Proven at petabyte scale in production Ozone has been validated in large-scale production environments, supporting petabyte-scale datasets and high object counts. This makes it a practical storage foundation for high transactional systems. These characteristics make Apache Ozone a strong fit as a general-purpose object store for lakehouse architectures. When paired with an open table format like Apache Iceberg, these same properties directly address the storage challenges that emerge as tables grow, evolve, and accumulate both data and metadata over time. Ozone addresses several core challenges that could be common in Iceberg-based lakehouse deployments. Scalable storage for data growth: Analytical datasets grow both in data size and in object count as tables are ingested, partitioned, rewritten, and optimized over time. Ozone is designed to scale both dimensions independently, distributing data across storage nodes while maintaining a consistent object namespace. Efficient handling of small objects: Beyond raw data growth, Iceberg tables generate a huge amount of metadata as part of normal table write or evolution. Ozone is built to handle large volumes of small objects without saturating metadata capacity, which is critical in lakehouse systems where metadata growth is intrinsic. Built-in durability, security, and availability: Ozone provides enterprise-grade storage features required in production lakehouse environments, including data encryption for security, erasure coding for storage efficiency, and replication for fault tolerance. These capabilities allow Ozone to serve as a durable system of record for both Iceberg data and metadata over long table lifecycles. S3-compatible access: Ozone exposes a native Amazon S3–compatible API, allowing Iceberg and multiple compute engines to interact with tables using standard object storage interfaces. As a result, Iceberg tables stored in Ozone follow the same layout and semantics as they would on cloud object storage. Exploring Apache Iceberg with Apache Ozone With the architectural context in place, let’s go through a hands-on exploration of Apache Iceberg tables with Apache Ozone as the storage system. Our goal is not to understand the Iceberg APIs here, but rather to see what the Ozone + Iceberg combination brings. All examples in this section are driven by a preconfigured Jupyter notebook available in the GitHub repository. You can run the notebook end-to-end to perform a sequence of table operations (create, write, evolve, update, delete). Rather than walking through each notebook cell, the sections below highlight what to observe in Apache Ozone as those operations execute. The notebook is preconfigured to use: Apache Iceberg as the table format An Iceberg REST catalog backed by object storage Apache Ozone (via its S3-compatible interface) as the storage layer for both data and metadata Apache Spark as the compute engine Iceberg table layout materialized in Apache Ozone Once you write to an Iceberg table, the first thing to observe is the creation of the /metadata and /data directories (of Iceberg) in Ozone’s file system (as seen below). You can inspect this layout directly using any S3-compatible client against Ozone’s S3 Gateway: aws s3api --endpoint-url http://s3.ozone:9878 \ list-objects \ --bucket warehouse \ --prefix nyc/taxis/metadata/ aws s3api --endpoint-url http://s3.ozone:9878 \ list-objects \ --bucket warehouse \ --prefix nyc/taxis/data/ As you progress through the notebook - writing data, evolving schemas, and performing updates or deletes, you will notice that the number of objects under metadata/ continues to grow. From a storage perspective, this means that metadata growth is intrinsic. Even modest tables can accumulate a large number of small metadata objects over time. So, we need to keep an eye out on this. Observing Iceberg activity through Apache Ozone management UIs One advantage of using Iceberg with Apache Ozone is the ability to observe storage behavior directly through Ozone’s management interfaces. Ozone internally separates metadata management, physical storage placement, and observability into distinct services. Understanding these components helps explain why Ozone is a strong storage foundation for open table formats like Iceberg. Ozone Recon: cluster and object-level visibility Ozone Recon is an observability and insight service that provides a consolidated view of cluster state, usage, and health. Recon is not involved in the read or write path, but it plays an important role in operating and debugging Ozone-backed lakehouse deployments. Recon provides: Cluster-level metrics and capacity insights Object (key) counts and namespace usage Container and pipeline health information Diagnostic views useful for troubleshooting storage-related issues When running Iceberg workloads, Recon allows operators to correlate table activity with storage behavior. As Iceberg generates new data and metadata objects over time, Recon makes it possible to observe how object counts, container usage, and cluster health evolve. In the Overview view, we can observe how the storage layer behaves as Iceberg operations are executed. In this run, the cluster remains healthy with 1 active datanode and 2 healthy containers. At this point, Recon reports 5 keys in the namespace, reflecting the data and metadata objects created by the Iceberg table. Despite ongoing object creation, container health and pipeline status remain stable, indicating that the workload does not introduce stress or instability at the storage layer. From a capacity perspective, the cluster shows ~40.5 GB used out of 452.1 GB available (9-10% utilization), with Iceberg-related data accounting for a small but growing portion of overall usage. This highlights an important aspect of Iceberg workloads: storage growth happens incrementally and continuously as tables evolve. Recon’s Insights view adds another layer of visibility into this behavior. The file size distribution reveals a mix of object sizes produced by the Iceberg workload, with multiple objects in the 8 KiB–16 KiB range alongside smaller objects in the 2 KiB–4 KiB range. This pattern reflects Iceberg’s operational model, where relatively small metadata and manifest files are created alongside larger Parquet data files. Ozone Manager (OM): metadata plane health The Ozone Manager is the master service responsible for managing Ozone’s object namespace and metadata. This includes volumes, buckets, keys, and the metadata required to map objects to their underlying storage blocks. OM’s responsibilities are strictly at the metadata and namespace level: Tracking object metadata (keys) created via the S3-compatible interface Maintaining a consistent view of volumes and buckets Coordinating metadata updates using Apache Ratis (Raft) to provide high availability and consistency Since Iceberg implicitly relies on the storage system to durably and consistently persist object metadata as new files are written during table evolution, OM ensures that object metadata updates are replicated and consistently visible across the cluster. Conclusion This blog highlights what Apache Ozone brings to an Iceberg-based lakehouse without requiring any special integration. As Iceberg operations execute, Ozone consistently acts as a durable system of record for both data and metadata objects. It absorbs continuous object creation driven by table evolution, scales object count independently of data size, and maintains a stable namespace as tables grow and change over time. Equally important is visibility. Through Ozone Manager and Recon, object growth, namespace health, container placement, and cluster state can be inspected and correlated directly with table activity. This makes it easier to reason about, operate, and debug metadata-intensive lakehouse workloads. In practice, what Apache Ozone brings is a storage foundation that aligns with the demands of modern table formats: scalable object storage, consistent metadata management, and first-class observability. When those properties are present, formats like Apache Iceberg can focus entirely on table-level semantics - while Apache Ozone reliably handles everything underneath.

dipankartnt · ‎12-04-2025

Table of Contents Apache Iceberg began as an internal project at Netflix to solve a very real problem: analytical data lakes built on file formats like Apache Parquet and ORC lacked reliable schema evolution, atomicity, and visibility into table state. To address these challenges, the team defined a specification that could enforce consistency and transactional guarantees at scale - introducing Iceberg as the “table format for slow-moving data.” If we take a step back, most early “big data” workloads relied on the Apache Hive table format stored in the Hadoop Distributed File System (HDFS). Over time, the Hive table format began to show its limitations - especially at Netflix’s scale and as workloads moved to cloud object stores like Amazon S3. One of the key design challenges was that Hive relied on directory structures to define tables, where each partition mapped to a folder and its files represented the data. This design worked well on HDFS, where directory listings were fast and consistent. But on S3, listing millions of files across nested partitions became slow and costly. Also, since S3 stores data as independent objects in a flat structure, it can throttle requests when too many target the same prefix. In a typical Hive table layout where partitions share a common prefix, concurrent listings or writes can easily hit S3 request-rate limits, causing throttling and 5xx errors. And there were other limitations with concurrency, corrupted table states, stale statistics, etc. that needed a fresh focus. These motivated the development of Iceberg with some clear design goals: Ensure table correctness and consistency, even with concurrent writes. Enable faster query planning and execution without full directory scans. Allow users to ignore the physical layout of files and focus on logical data. Support schema and table evolution safely. Accomplish all of this at cloud scale. Iceberg’s answer was to redefine what a “table” means on a data lake. Instead of treating all files under a directory as the table, Iceberg introduced the concept of a canonical list of data files, tracked through metadata, manifests and snapshots. Each commit represents a complete, immutable view of the table at a point in time, enabling atomic operations, time travel, and isolation without relying on brittle directory structures. This design not only addressed the core limitations of Hive-style tables but also laid the foundation for the open lakehouse architecture. By separating the logical definition of a table from its physical layout, Iceberg made it possible for multiple compute engines to operate on the same dataset with full transactional guarantees. Iceberg’s open, versioned table specification ensures that these capabilities are not tied to any single engine or vendor, allowing consistent behavior and interoperability across diverse systems. From there, Iceberg evolved through successive specifications (v1, v2, v3, and the upcoming v4), each extending its capabilities to support new data types, row-level operations, and other innovations that developers now rely on in production-grade open lakehouses. Evolution of the Apache Iceberg Table format Specification With that bit of history, let’s go understand how Iceberg’s specification has evolved over time and what can engineers/developers look forward to in the upcoming one. Spec v1: How Iceberg Enabled Analytic Tables at Scale The first specification, v1, defined how to manage large analytical tables on immutable file formats like Parquet, Avro, and ORC. Its core goal was to bring database-like guarantees (ACID), and schema evolution to the data lake, while staying agnostic to the underlying compute engine. v1 introduced a multi-layered metadata architecture: Data files stored the actual table content. Manifest files listed groups of data files and their statistics (min/max values, record counts, partition information). A manifest list tracked all manifests that together defined the current table snapshot. A metadata file pointed to the latest snapshot and table-level properties such as schema, partition spec, and snapshot lineage. This hierarchy made every commit an immutable snapshot. So, readers could rely on a consistent view while writers performed atomic updates using a new metadata file reference. It solved a critical problem that Hive couldn’t: snapshot isolation on object storage. Each table version could be reconstructed precisely using metadata alone, enabling time travel and rollback operations without duplicating data. Spec v1 also formalized schema evolution. Fields could be added, renamed, or deleted safely without rewriting existing data files, since schema and column IDs were tracked independently from file structure. This was a foundational change - decoupling physical data layout from logical schema made Iceberg a durable layer over immutable files. Through releases up to v0.11.0, this specification became the basis for large-scale analytical workloads. Spec v2: How Iceberg Supports Row-Level Deletes and Incremental Mutations By the time Iceberg reached v0.11.1, the community had voted to approve Spec v2 - a major step forward. While v1 worked well for immutable, append-only workloads, real-world production systems needed efficient row-level deletes and updates to handle CDC (Change Data Capture), GDPR deletions, and late-arriving data corrections. Spec v2 introduced delete files, which encode rows to be deleted or replaced within existing data files. This design enabled a new table-write pattern known as Merge-on-Read (MoR). Instead of rewriting entire data files, as done in the Copy-on-Write (CoW) pattern, MoR tables simply append new delete files and let readers reconcile them at query time. The merge process produces a consistent view by applying deletes over the underlying data files dynamically. There are two types of delete files that implement this behavior: Position deletes - mark rows for removal by their physical position in a data file Equality deletes - mark rows based on column values (for example, delete all rows where id = 123) This mechanism drastically reduced write amplification, making row-level updates practical on immutable storage - a foundational capability for streaming ingestion and near–real-time pipelines. Another subtle but important change in v2 was stricter writer guarantees. While atomic commits and snapshot isolation which form the foundation of Iceberg’s optimistic concurrency control (OCC) model, had existed since v1, Spec v2 reinforced this model by formalizing writer-side validation semantics. Writers were now required to validate their parent snapshot lineage during commit, strengthening the OCC-based transaction model and ensuring consistent behavior for concurrent row-level operations across engines. By v1.4.0, Iceberg adopted format-version = 2 as the default for new tables. At this point, Iceberg had evolved from a read-optimized analytical format into a mutable table layer that is capable of low-latency upserts, streaming merges, and transactional safety across distributed writers. Spec v3: Extended Types, Metadata Enhancements & Row Lineage As workloads diversified, the community began tackling a broader set of problems - handling semi-structured and geospatial data, adding lineage tracking, and improving deletion efficiency. These efforts culminated in Spec v3, reflected in Iceberg releases v1.8.0 to v1.10.0 (2025). The first wave of v3 features appeared in v1.8.0 (February 2025), introducing capabilities like: Binary Deletion Vectors (DVs): Store row-level delete information as compact binary bitmaps, removing the need for separate delete files and enabling faster merges. Variant Type: A flexible column type for semi-structured JSON-like data, allowing ingestion of untyped data without strict schema enforcement. New Geospatial Types: Support for geometry and geography data to power location analytics and mapping use cases. Nanosecond Precision Timestamps (with/without TZ): For event and telemetry workloads demanding precise temporal resolution. Later, v1.9.0 (April 2025) and v1.10.0 (September 2025) expanded v3’s depth: Row Lineage Tracking: metadata fields that allow engines to detect row-level changes between commits, which simplifies incremental processing. Default Values & Multi-Argument Transforms: Allow specifying default column values for schema evolution and defining partitioning or sorting transforms that use multiple input columns. Table Encryption Keys: Built-in encryption primitives to secure data at rest, complementing external key management systems. Spec Clarifications: Write requirements to prevent orphaned deletion vectors and ensure consistent reader behavior across engines. Why Do These Innovations Matter for Developers? The v3 spec opened new possibilities for how developers build and operate data pipelines. Many of the additions in this spec directly address pain points that emerge once systems reach scale and workloads increase. Let’s see a few real-world use cases where these new innovations add value. Incremental & CDC Processing Until recently, building incremental pipelines over immutable data was complicated and costly. Iceberg v3’s row lineage capabilities, through internal _row_id and _sequence_number metadata give each row a persistent identity across commits. Engines can now detect exactly which rows changed between snapshots. For developers, this makes CDC ingestion, materialized view refreshes, and downstream incremental transformations far simpler. Instead of rescanning entire partitions or maintaining complex diff logic, pipelines can now consume only what changed, enabling faster, cheaper, and more reliable refresh cycles. Efficient Row-Level Deletes and Read Performance Spec v2 enabled row-level deletes through external delete files, which reduced write amplification but created operational overhead. Many engines didn’t compact these files automatically, and in production tables they often accumulated over time, forcing readers to merge dozens of delete files during scans. Spec v3 introduces binary deletion vectors (DVs), allowing developers to handle row-level deletes efficiently without relying on manual compaction jobs. DVs store deleted-row markers as compact bitmaps linked to data files, simplifying maintenance, reducing read-time merging, and improving performance for mutation-heavy workloads such as CDC ingestion or streaming upserts. Dealing with Diverse & Evolving Data Modern pipelines increasingly mix structured, semi-structured, and geospatial data. Before Iceberg v3, developers had to flatten JSON into many nullable columns or store it as strings, which is both inefficient for filtering and query pushdown. Spec v3 introduces a VARIANT type that encodes semi-structured data natively, allowing engines to query nested fields without full scans or heavy parsing. It also adds GEOMETRY and GEOGRAPHY types for spatial data, enabling efficient spatial joins and region-based filtering directly on coordinates or shapes. Together with default column values, these extensions make schema evolution safer and allow diverse data types to be stored and queried efficiently within the same table. Optimizing Partitioning and Query Planning Earlier specs of Iceberg restricted partition transforms to a single column. For example, a bucket(16, user_id) transform could hash only one input field. In some cases though, developers often needed composite bucketing, like distributing data by a combination of columns (country, city) to improve data locality and query pruning. Spec v3 extends the partition specification to support multi-argument transforms, particularly for the bucket transform. This allows developers to define composite partition keys natively instead of encoding them into a single field. By enabling hashing and partitioning across multiple columns, query engines can plan scans more precisely and reduce shuffle during joins or aggregations. The result is better data clustering and lower compute overhead for large analytical workloads. Building Secure and Governed Tables Finally, Spec v3 introduces table-level encryption keys, giving organizations a consistent way to secure Iceberg tables at rest and manage key rotation policies. This is particularly relevant in multi-tenant or regulated environments where governance and compliance are critical. Spec v4: What’s Next for Developers Building with Iceberg Spec v4 is still in planning and proposals are being made, but the direction is clear. A lot of the work is focused on tightening the format around real pain points developers hit once tables get big, commits get frequent, and metadata becomes the bottleneck. Let’s take a look at four of these proposals that have had momentum from the Iceberg community. Single-File Commit Every Iceberg commit today involves writing multiple metadata layers - a new metadata.json, a manifest list, and one or more manifest files. This structure introduces unnecessary overhead for small or frequent writes. Even a small update requires multiple metadata rewrites, delete operations often trigger full manifest rewrites (with CoW), and caching manifests across commits becomes difficult since files are frequently replaced. Spec v4 proposes a simpler model built around a Root Manifest, which replaces the manifest list and acts as the single entry point for each snapshot. The hierarchy collapses into a clean two-level structure: Root Manifest -> Data Manifests / Delete Manifests / Files Each commit now modifies only what changed, keeping metadata growth proportional to the size of the operation rather than the size of the table. The benefits are immediate for developers: faster commits and fewer metadata rewrites. Query planning also improves, as the Root Manifest can aggregate file-level metrics from its children, allowing pruning to happen earlier. Together, these changes make Iceberg better suited for streaming and micro-batch workloads where commits are small but frequent. Storing Metadata in Parquet Since its early versions, Iceberg has stored metadata files in Apache Avro. That choice worked well when manifests were small and queries read them as whole records. But as tables have grown with hundreds of columns and thousands of file-level statistics, reading entire manifest rows has become expensive. Query engines often need just a subset of fields (for example, file paths and a single column’s min/max), but Avro forces them to deserialize the entire record. The community is now proposing transitioning metadata files to a columnar format with Apache Parquet. This change allows engines to read only the necessary columns, improving planning efficiency and memory usage. It also aligns metadata storage with data storage, unlocking optimizations like column pruning and predicate pushdown even for metadata queries. In combination with the new single-file commit model, this ensures that query planning remains fast even as metadata becomes richer and more expressive. Column Statistics Rework: Making Stats First-Class Another proposal focuses on redesigning how column statistics are represented. Currently, stats for each column, such as lower and upper bounds, null counts, and value sizes are stored as a generic map from field IDs to values. While functional, this approach creates several problems: it’s inefficient for wide tables, loses type information during serialization, and makes it hard to project only specific stats. This new proposal introduces a typed, structured representation of column stats. Each field’s statistics will be stored with preserved logical and physical types, making them more reliable through schema evolution. Engines will be able to read individual stats (for example, just the lower bounds for a few columns) without loading everything into memory. This change also makes statistics extensible, i.e. developers will be able to attach richer per-field metrics for emerging data types like VARIANT or GEOMETRY, and query engines can use them for smarter pruning. In practice, this means more predictable performance on wide, evolving schemas, and better planning efficiency for mixed workloads. Relative Paths This has been a long-standing operational issue - Iceberg stores all file paths as absolute URIs. This can be challenging when you need to move a table - between buckets, regions, or even storage systems, where each embedded path must be rewritten. For replication, disaster recovery, or multi-region deployments, this has been cumbersome. Spec v4 proposes support for relative paths within table metadata. By storing references relative to the table root, Iceberg tables can be moved or copied without rewriting metadata. The internal relationships between data and metadata files remain consistent, and absolute paths can still be used where needed for external data. This makes replication, backup, and migration simpler. Community Drives Innovation The pace of Iceberg’s evolution is largely a reflection of its community. Every new capability has emerged from real operational challenges faced by developers across organizations. Users and developers from companies like Netflix, Apple, Dremio, Tabular, Snowflake, Cloudera, and many others contribute code, specifications, and design reviews. Each change is first proposed as a public discussion or design document, iterated on by the community, and only then voted into the specification. This process ensures that new features address genuine production problems, rather than being driven by any single engine or vendor. It also creates a rapid feedback loop - as developers deploy Iceberg at scale, they bring back real-world lessons into design discussions. That’s why the features in recent specs, e.g. deletion vectors, root-level manifests, and columnar metadata can be traced directly to patterns seen in high-throughput streaming and batch environments. Iceberg has matured from solving basic data management challenges on the data lake to defining how transactional workloads operate in open data environments. Each specification has extended the boundaries of what a table format can do - adding the kind of features developers expect from databases and warehouses into the lakehouse architecture, while offering openness and interoperability. The result is a format that now underpins a wide spectrum of workloads: batch analytics, incremental processing, streaming ingestion, and AI pipelines. Cloudera’s Commitment to the Iceberg Journey Cloudera introduced native support for Apache Iceberg in its public cloud Lakehouse platform in 2021, extending it to on-premises deployments the following year. Since then, Iceberg has become central to how customers build modern data architectures across hybrid and multi-cloud environments. Today, petabytes of data are managed in Iceberg tables on Cloudera, powering everything from near real-time analytics and regulatory compliance workloads to AI data preparation and large-scale data engineering pipelines. Alongside this, the newly launched Cloudera Lakehouse Optimizer automates table maintenance operations that would otherwise require manual tuning. It continuously manages small file compaction, manifest rewriting, and layout optimization, improving query performance and reducing storage costs. For engineers, this means less operational overhead - no babysitting tables, no manual compaction or cleanup, while maintaining the same consistency guarantees across all Iceberg-compatible engines. By aligning open table formats with enterprise-grade governance, hybrid deployment flexibility, and automated optimization, Cloudera's platform aims to make it simpler for developers and data teams to adopt Iceberg confidently across environments. Cloudera also contributes to the Apache Iceberg community through ongoing code contributions, community initiatives such as meetups, and developer-focused educational resources. Join our Community!

Online	Offline
Last Visited	‎05-21-2026 09:15 AM

Member Since	‎10-17-2025 11:25 AM
Last Visited	‎05-21-2026 09:15 AM
Posts	8
Kudos received	3

Cloudera Community

The Architecture of Apache Polaris - Under the Hoo...

Cloudera Open Source Meetup Series: Bringing the O...

How we Helped make Apache Iceberg Cool - Before th...

Building an Open Lakehouse with Apache Iceberg and...

Apache Iceberg: Key Innovations So Far and What’s ...