Member since
10-17-2025
6
Posts
1
Kudos Received
0
Solutions
03-02-2026
01:17 PM
A few days back we co-hosted a Lakehouse meetup in New York City with Cloudflare and LanceDB, bringing together folks from the Apache Iceberg, Apache DataFusion, and Lance communities. At Cloudera, we view supporting and running open source meetups as critical drivers for the ecosystem itself and we have done this for a while now. Open source projects like Iceberg, DataFusion, and Lance evolve through real-world feedback, and those conversations happen best when practitioners are in the same room. These meetups create that perfect space, where design decisions are debated, trade-offs are brought in, and implementation realities are discussed openly. This event was a good example of that! We had three dedicated talks targeting these three projects and opened up the rest of the time for community networking. Iceberg Spec Evolution - v1 to v4 & how Cloudera supports I kicked off the evening with a session on Apache Iceberg’s specification evolution. We walked through how specs v1 and v2 addressed foundational table abstraction and row-level operations, then spent more time on v3 and the ongoing v4 proposals. We spent time discussing: Binary Deletion Vectors: Moving beyond positional deletes to binary bitmaps. This is a game-changer for write-heavy workloads, significantly reducing the I/O overhead of row-level updates. Row-Level Lineage: The introduction of stable row identifiers in v3, which finally makes true Change Data Capture (CDC) and incremental processing feel native to the table format. The V4 Horizon: Touched on the active proposals for v4, including single-file commit to simplify metadata writes and tackle metadata bloat issues, using Parquet for metadata (replacing Avro) to allow for columnar metadata reads. This would let engines skip even more data by only loading the specific metadata fields they need for a query plan, among other things. The focus was about understanding why each change was introduced and how it affects developers building lakehouse pipelines at scale. A lot of the discussion centered around practical implications: metadata growth, mutation patterns, execution behavior, and how these changes surface in real deployments. We also touched on how Cloudera has supported Iceberg’s core capabilities from early on - and what it means to support spec evolution inside a production platform. We have navigated the transition from the early v1 specs, and now as the community pushes into the v3/v4 frontier, our focus remains on making those powerful new capabilities - like deletion vectors and row-level lineage available. The questions from the room reflected that people are actively thinking about these new opportunities and about Iceberg’s adoption in data platforms. Cloudflare’s Data Platform: Iceberg + DataFusion Jonathan Chen from Cloudflare then introduced their new data platform built on R2, R2 Data Catalog, R2 SQL, and Pipelines. The architecture combines object storage, Apache Iceberg as the table layer, and Apache DataFusion as the query engine - enabling ingestion and SQL analytics directly over object storage. Jonathan explained how the R2 SQL engine (built on Apache DataFusion) uses a scatter-gather architecture to run analytics directly on Iceberg tables stored in R2 and how the engine can now handle aggregations (SUM, COUNT, etc.) and complex JOINs without the data ever leaving the Cloudflare network. It’s a compelling look at a "no-infra" future where the storage itself is smart enough to answer your SQL queries. Multimodal AI Lakehouse with Lance Chang She (Co-founder & CEO of LanceDB) brought a different flavor to the conversation: the multimodal challenge. While Iceberg handles our analytical tables, how do we handle the billions of embeddings and blobs required for modern AI? The next wave of AI (think Midjourney, WorldLabs, and Runway) requires seamless, scalable access to much more than just numbers and strings. We’re talking about text, images, embeddings, and complex modalities. Chang introduced Lance, a columnar data format optimized specifically for AI, and LanceDB, the multimodal lakehouse built on top of it. The highlight was also the new "branch/tag" capability. It essentially functions as "Git for AI Data," allowing data scientists to create zero-copy clones of massive datasets for experimentation. This means you can branch a production dataset, run a fine-tuning job or a transformation experiment, and then either merge or discard it. Final Thoughts If you have been to a great meetup, you know the schedule is only half the story. After the talks, a lot of folks came up to continue the conversation - especially around what’s coming next in Apache Iceberg and how to think about the upcoming spec direction. There were also good side discussions sparked by the Lance session, particularly around multimodal workloads and how teams are starting to think about vectors, text, and images alongside tabular lakehouse tables. This is exactly why we at Cloudera Community care about these meetups. The value isn’t only the presentations - it’s the direct, candid conversations that happen around them, across projects and across communities. It goes without saying that an event like this doesn't just happen. This meetup has been a long time coming, growing from early discussions between myself, Prashanth, ChanChan, and Jonathan about the need for a dedicated data infra meetup in the city. A massive thanks to the Cloudflare team for providing such a great space, and a huge shoutout to Cloudera, LanceDB, and Cloudflare for their support in making this a reality. Join us at Cloudera Community to keep a track of all the meetups/events.
... View more
02-26-2026
01:17 PM
This blog is written in collaboration with Alex Merced - Head of DevRel @Dremio There was a time when talking about “Open table formats” would get you a polite nod… and then the conversation would move on. This was of course before “lakehouse” became a strategy slide in every vendor or enterprise deck. And before modular data architectures were a mainstream discussion. Back then, the idea that storage format layers should be open felt abstract and theoretical. Maybe even non-relevant in the greater scope of things! We were told, more than once: “Why does this matter?” “Isn’t Parquet enough?” “We already have a Data Warehouse & Data Lake” “Just use Delta” And to be fair, those weren’t unreasonable questions at the time. Apache Iceberg wasn’t a new query engine. It wasn’t a new database or a shiny AI model. It was basically metadata in a more technical sense. And an open table format specification. Unfortunately, these things have never been in vogue - until users started seeing what it truly means! The Problem Nobody Thought Was a Problem One of the interesting things about the early Iceberg conversations is that most people didn’t believe there was a structural issue in the first place. On one hand, cloud data warehouses were widely adopted and became that centralized repository for structured BI workloads and on the other hand, data lakes with Apache Parquet as the file format and Apache Hive as the table format became the standard for serving AI use cases. Databricks Delta Lake (proprietary version of the Delta Lake table format) was also in use by customers that were already in the Databricks (or Azure Databricks) ecosystem. From the outside, things seemed fine. So when we began advocating for an open table format with a formal specification, and spoke about things like snapshot isolation, structured metadata trees, and partition evolution, the reaction was often confusion rather than resistance. Many people simply didn’t see what we wanted them to see. And that was because the pain rarely showed up as “our table abstraction is flawed.” It actually showed up in much more practical, day-to-day complaints. Things like - Our Spark job is slow Listing partitions on S3 takes forever We can’t change partitioning without rewriting everything Schema evolution broke our downstream job Two jobs wrote to the same table and now we have corrupted data These were treated as performance problems, operational mistakes, or scaling challenges. Teams would add scripts. They would add locks. They would document rules explaining how to safely evolve schemas or manage partitions. Over time, those workarounds hardened into architectural debt. One of the main reasons for these issues was that a “table” was never a first-class abstraction in data lakes. It was an agreement layered on top of the file system. You had directories containing Parquet files, and a Hive table pointing to that directory. Table semantics were effectively determined by the engine that interpreted them. When datasets were small, listing files from the file system was tolerable and when only one engine wrote to the table, behavioral assumptions held. But as volumes increased and workloads diversified, the cracks began to show. For engineers working with cloud data warehouses, these issues didn’t matter as the storage layer has always been abstracted by vendors with proprietary file and table formats and maintenance was taken care of by the warehouse’s cleaner. The most common problem for these users was - data getting locked into the vendor systems and the super high costs that came with it. Iceberg’s proposition was not about just faster queries or shinier features. It was about formalizing the table format as an independent layer with an open specification for all compute engines to abide by. This open specification ultimately became the solution to the problems we discussed. The Early Struggles If the technical argument behind Iceberg was subtle, the real challenge was cultural. We were not introducing a solution that replaced something obvious. We were introducing a new idea that required people to rethink their storage layers and the impact they have. Unfortunately, that kind of shift does not happen through just feature highlights or benchmarks - it requires changing mental models. During that time, most engineers did not wake up thinking, “I need an open table format.” Their goal was to ship pipelines, optimize jobs, reduce storage costs, or stabilize production workloads. The storage layer was not top of mind. So when we started speaking about formal specifications, snapshot isolation, and metadata trees, we were effectively asking people to care about the invisible foundation beneath their existing systems - the historically abstracted stuff that we mentioned before. We realized fairly quickly that education was going to be a crucial aspect for engineers to understand the need for something like Iceberg/Open Lakehouse. We had to explain what a table abstraction actually means in a data lake, why relying on proprietary storage systems brings long-term constraints, and why modularity and openness was critical. From Iceberg’s technical architecture POV, we had to explain how the metadata tree structure brings snapshot isolation, how hidden partitioning worked, and why concurrency control models were imperative to running multiple workloads on the same table. Note that these were not surface-level topics. They required long-form explanations, breaking down complex things in the form of diagrams, and showing how specific use cases can be implemented with Iceberg. At conferences, the reception was thoughtful but measured. The questions were, sometimes skeptical, but often practical: How does this compare to Hive? How is this different from Delta? Are we introducing another layer? Oftentimes, half the effort was simply clarifying what Iceberg was and what it was not. But those questions were important as they forced us to sharpen the articulation of the value proposition. How we Evangelized Iceberg? In the early days of Apache Iceberg adoption, formal evangelism around the project was almost entirely driven by the engineers building it. Outside of committers and contributors, there was little dedicated effort focused on education, storytelling, or community-building around Iceberg as a standalone technology. That began to change when Alex Merced joined Dremio in December 2021, followed by Dipankar Mazumdar in February 2022 as the first Developer Advocates with a primary focus on Apache Iceberg. What followed was an organic and formative period of experimentation. There was no established playbook for how to “evangelize” a table format. Instead, advocacy meant learning in public, translating deep technical concepts into practical guidance, and showing up consistently wherever the data community gathered. Once it became clear that this would be a long-term effort, the approach became deliberate. If the table format abstraction was unfamiliar, we had to make it understandable. If the ecosystem lacked vocabulary, we had to build it. So, we organized our efforts around a few core pillars. Foundational Blogs & Hands-on Exercises The first pillar was long-form technical writing. This was deliberately targeted at explaining the core architectural concepts of Iceberg, how it fundamentally worked, and how it compared with other formats at the time. We wrote deep technical blogs explaining how Iceberg looked under the hood, and how read and write queries worked. We unpacked how hidden partitioning helped with fewer accidental table scans. We went over Iceberg’s key features and why Puffin file format was introduced and was necessary for additional statistics that could help with performance improvements. But we quickly realized that reading alone was not enough and engineers needed to see and run things. So alongside the blogs, we built hands-on exercises. This repository became a practical companion to the writing - a place where readers could experiment with table creation, schema evolution, partition changes, and understand the internal behavior themselves. Webinars/Podcasts The second pillar was live education. We ran webinars/podcasts focused on the mechanics of Iceberg. These provided space for live demos, deeper dives, and real-time Q&A, often revealing where understanding of Iceberg was still unclear. Early sessions were as much learning experiences for presenters as for attendees. Much of the time was spent dissecting real questions about how the system behaves under scale, concurrency, and multi-engine access. Over time, these sessions became less about explanation and more about application. Engineers began arriving with their own workload patterns and architectural constraints, asking how Iceberg would behave in their system rather than in a generic example. That evolution led us to introduce dedicated office hours. Instead of one-to-many presentations, we created open forums where engineers could bring specific production scenarios, performance issues, or how-to questions. The goal was to reduce friction for real adopters and make the abstraction practical, not theoretical. Those office hours became one of the most important feedback loops. They exposed edge cases, clarified documentation gaps, and often influenced how we explained Iceberg going forward. Conferences & Ecosystem Conversations Conferences played quite a different role than blogs or webinars. We were showing up in places where Iceberg (or Table Format) was not yet the dominant topic.Our talks often required contextual framing before diving into technical details. We had to explain the problem space before explaining the solution. That meant sessions focused as much on open lakehouse architecture as on Iceberg itself. But conferences were not just educational moments - they were ecosystem checkpoints. These events brought together practitioners from different organizations who were solving similar problems in parallel. So, naturally, the conversations moved beyond “How does this feature work?” toward “How are you running this in production?” and “What are you doing for compaction at scale?” So, we were onto the next level, and instead of debating whether the table abstraction was needed, engineers were comparing operational strategies. Conferences actually became a place where ecosystem alignment formed in public. Different vendors, contributors, and adopters were discussing the same specification, the same semantics, and the same trade-offs. It reinforced the idea that Iceberg was not tied to a single company’s roadmap but was evolving as an independent specification. Books and Research Paper By this point, we had already produced a significant body of written material - deep technical blogs, architectural breakdowns, hands-on guides. All of these were amazing to see but we wanted to consolidate these into something more formal and durable. One outcome of that consolidation was publishing formal research work, including the paper “The Data Lakehouse: Data Warehousing and More”. It was a way to capture the “why” behind the open lakehouse paradigm and compare it to traditional database systems like data warehouses. In parallel, there was also a clear need for something even more practitioner-oriented and complete: a definitive reference that engineers could keep on their desk. That’s where the work on Apache Iceberg: The Definitive Guide came in. A book is quite different from everything else. It forces you to organize the subject end-to-end: table structure, metadata layers, table operations, performance patterns, and production practices. Community Building & Public Collaboration We want to stress that while the blogs, talks, and other educational materials helped explain Iceberg to the masses, the real inflection came from the diverse community that began forming around it. Iceberg was never positioned as a vendor-owned format and its specification evolved in public. Committers from different organizations made proposals in the open. Integrations across engines & systems were built by contributors with different priorities and production realities. The mailing lists, Slack channels, and conference hallways became places where real-world lessons were exchanged. And our role in that process was to amplify these narratives and connect it. By consistently highlighting community contributions, helping users on Iceberg Slack, inviting committers and adopters to speak in webinars and conferences, and creating spaces like office hours for open discussion, we tried to make participation visible and accessible. But the exchange was never one-directional. Those conversations sharpened our own understanding. For e.g. the office hours brought problems we hadn’t considered, slack discussions revealed documentation gaps, and talks by external adopters brought real production constraints into the spotlight. The more the community engaged, the more grounded and precise our messaging became. Closing Reflections Looking back, the early days of Apache Iceberg evangelism were defined by experimentation, curiosity, and a strong belief in open standards. Without a clear roadmap, advocacy evolved through blogs, webinars, events, books, and countless conversations with the data community trying to make sense of a rapidly changing data landscape. What began as a small effort to explain a new table format grew into sustained engagement with a global lakehouse community. If you are interested in continuing to learn more about Apache Iceberg from Dipankar and Alex, follow these channels: Blogs - Cloudera Community Cloudera Developers Playlist The Dremio Blog Below this post, you’ll find a list of published works that Alex and Dipankar have been part of over the years. We encourage you to explore these writings, talks, and recordings to see how the ideas around Apache Iceberg and the lakehouse have evolved through that work. Apache Iceberg: The Definitive Guide Engineering Data Lakehouse with Open Table Formats The Data Lakehouse Paper: Data Warehousing and More Apache Polaris: The Definitive Guide Architecting an Apache Iceberg Lakehouse The Apache Iceberg Digest Vol. 1 The Book on Using Apache Iceberg with Python The Book on Agentic Analytics awesome-lakehouse-guide Github Repository
... View more
Labels:
01-27-2026
04:58 PM
1 Kudo
Open Lakehouse architecture brings in the modularity and flexibility needed to run multiple analytical workloads on top of a single source of data powered by open table formats like Apache Iceberg. Rather than relying on a tightly coupled, monolithic system, it allows teams to compose their data platform from independent building blocks - storage, table formats, catalogs, and compute engines, each selected based on specific workload and operational requirements. Storage is one of the key components of a lakehouse architecture. It is the foundation on which table formats implement transactional semantics, ACID guarantees, metadata management, and enable multi-engine interoperability. Decisions made at the storage layer directly influence how reliably tables can evolve, scale, and be shared across systems. Apache Iceberg: Database semantics on Object storage Apache Iceberg implements ACID properties on top of object storage and provides a schema to refer to the data files (e.g. Apache Parquet) as a “table.” At its core, Iceberg separates - physical data storage (immutable Parquet/ORC/Avro files) and logical table state (schemas, partitions, snapshots, file-level statistics). This separation is fundamental to how Iceberg behaves and enables multiple compute engines to work together on the same table at the same time. Instead of relying on directory layouts or file naming conventions, Iceberg maintains explicit metadata files that describe: Which data files belong to the table How they are partitioned Which snapshot represents the current table state How the table evolved over time As tables evolve, both metadata and data files grow in number, making the behavior and scalability of the underlying object storage a first-class concern for Iceberg deployments. Apache Ozone: Object Store for Lakehouse Apache Ozone is designed with specific requirements in mind. It is a highly scalable, distributed object store built to support growing data volumes while also handling large numbers of smaller objects without exhausting metadata capacity. This makes it well-suited for lakehouse systems where table formats continuously generate new data and metadata artifacts as tables evolve. What does Apache Ozone bring to the “Table”? Before diving into the hands-on workflow, it’s worth briefly summarizing what Apache Ozone brings as the storage system. Open-source, cloud-native object and file storage Apache Ozone is an open-source, cloud-native storage system designed to scale to billions of objects and hundreds of petabytes. Its architecture is built for distributed deployments, making it suitable for large analytical platforms where storage growth is continuous and long-lived. Dual-access semantics: object and filesystem APIs Ozone supports native S3-compatible access for modern data platforms while also exposing traditional filesystem semantics (OFS). This dual-access model allows the same underlying data to be accessed via both ways, enabling gradual migration and mixed workloads within a single deployment. Strong consistency without a centralized metadata bottleneck Ozone provides strong consistency guarantees while avoiding the traditional NameNode bottleneck by fully decoupling metadata management and storage. The metadata plane (Ozone Manager) and storage plane (Storage Container Manager) operate independently and are coordinated using Apache Ratis, a Raft-based consensus protocol. This design enables scalable metadata operations without sacrificing correctness. Proven at petabyte scale in production Ozone has been validated in large-scale production environments, supporting petabyte-scale datasets and high object counts. This makes it a practical storage foundation for high transactional systems. These characteristics make Apache Ozone a strong fit as a general-purpose object store for lakehouse architectures. When paired with an open table format like Apache Iceberg, these same properties directly address the storage challenges that emerge as tables grow, evolve, and accumulate both data and metadata over time. Ozone addresses several core challenges that could be common in Iceberg-based lakehouse deployments. Scalable storage for data growth: Analytical datasets grow both in data size and in object count as tables are ingested, partitioned, rewritten, and optimized over time. Ozone is designed to scale both dimensions independently, distributing data across storage nodes while maintaining a consistent object namespace. Efficient handling of small objects: Beyond raw data growth, Iceberg tables generate a huge amount of metadata as part of normal table write or evolution. Ozone is built to handle large volumes of small objects without saturating metadata capacity, which is critical in lakehouse systems where metadata growth is intrinsic. Built-in durability, security, and availability: Ozone provides enterprise-grade storage features required in production lakehouse environments, including data encryption for security, erasure coding for storage efficiency, and replication for fault tolerance. These capabilities allow Ozone to serve as a durable system of record for both Iceberg data and metadata over long table lifecycles. S3-compatible access: Ozone exposes a native Amazon S3–compatible API, allowing Iceberg and multiple compute engines to interact with tables using standard object storage interfaces. As a result, Iceberg tables stored in Ozone follow the same layout and semantics as they would on cloud object storage. Exploring Apache Iceberg with Apache Ozone With the architectural context in place, let’s go through a hands-on exploration of Apache Iceberg tables with Apache Ozone as the storage system. Our goal is not to understand the Iceberg APIs here, but rather to see what the Ozone + Iceberg combination brings. All examples in this section are driven by a preconfigured Jupyter notebook available in the GitHub repository. You can run the notebook end-to-end to perform a sequence of table operations (create, write, evolve, update, delete). Rather than walking through each notebook cell, the sections below highlight what to observe in Apache Ozone as those operations execute. The notebook is preconfigured to use: Apache Iceberg as the table format An Iceberg REST catalog backed by object storage Apache Ozone (via its S3-compatible interface) as the storage layer for both data and metadata Apache Spark as the compute engine Iceberg table layout materialized in Apache Ozone Once you write to an Iceberg table, the first thing to observe is the creation of the /metadata and /data directories (of Iceberg) in Ozone’s file system (as seen below). You can inspect this layout directly using any S3-compatible client against Ozone’s S3 Gateway: aws s3api --endpoint-url http://s3.ozone:9878 \
list-objects \
--bucket warehouse \
--prefix nyc/taxis/metadata/ aws s3api --endpoint-url http://s3.ozone:9878 \
list-objects \
--bucket warehouse \
--prefix nyc/taxis/data/ As you progress through the notebook - writing data, evolving schemas, and performing updates or deletes, you will notice that the number of objects under metadata/ continues to grow. From a storage perspective, this means that metadata growth is intrinsic. Even modest tables can accumulate a large number of small metadata objects over time. So, we need to keep an eye out on this. Observing Iceberg activity through Apache Ozone management UIs One advantage of using Iceberg with Apache Ozone is the ability to observe storage behavior directly through Ozone’s management interfaces. Ozone internally separates metadata management, physical storage placement, and observability into distinct services. Understanding these components helps explain why Ozone is a strong storage foundation for open table formats like Iceberg. Ozone Recon: cluster and object-level visibility Ozone Recon is an observability and insight service that provides a consolidated view of cluster state, usage, and health. Recon is not involved in the read or write path, but it plays an important role in operating and debugging Ozone-backed lakehouse deployments. Recon provides: Cluster-level metrics and capacity insights Object (key) counts and namespace usage Container and pipeline health information Diagnostic views useful for troubleshooting storage-related issues When running Iceberg workloads, Recon allows operators to correlate table activity with storage behavior. As Iceberg generates new data and metadata objects over time, Recon makes it possible to observe how object counts, container usage, and cluster health evolve. In the Overview view, we can observe how the storage layer behaves as Iceberg operations are executed. In this run, the cluster remains healthy with 1 active datanode and 2 healthy containers. At this point, Recon reports 5 keys in the namespace, reflecting the data and metadata objects created by the Iceberg table. Despite ongoing object creation, container health and pipeline status remain stable, indicating that the workload does not introduce stress or instability at the storage layer. From a capacity perspective, the cluster shows ~40.5 GB used out of 452.1 GB available (9-10% utilization), with Iceberg-related data accounting for a small but growing portion of overall usage. This highlights an important aspect of Iceberg workloads: storage growth happens incrementally and continuously as tables evolve. Recon’s Insights view adds another layer of visibility into this behavior. The file size distribution reveals a mix of object sizes produced by the Iceberg workload, with multiple objects in the 8 KiB–16 KiB range alongside smaller objects in the 2 KiB–4 KiB range. This pattern reflects Iceberg’s operational model, where relatively small metadata and manifest files are created alongside larger Parquet data files. Ozone Manager (OM): metadata plane health The Ozone Manager is the master service responsible for managing Ozone’s object namespace and metadata. This includes volumes, buckets, keys, and the metadata required to map objects to their underlying storage blocks. OM’s responsibilities are strictly at the metadata and namespace level: Tracking object metadata (keys) created via the S3-compatible interface Maintaining a consistent view of volumes and buckets Coordinating metadata updates using Apache Ratis (Raft) to provide high availability and consistency Since Iceberg implicitly relies on the storage system to durably and consistently persist object metadata as new files are written during table evolution, OM ensures that object metadata updates are replicated and consistently visible across the cluster. Conclusion This blog highlights what Apache Ozone brings to an Iceberg-based lakehouse without requiring any special integration. As Iceberg operations execute, Ozone consistently acts as a durable system of record for both data and metadata objects. It absorbs continuous object creation driven by table evolution, scales object count independently of data size, and maintains a stable namespace as tables grow and change over time. Equally important is visibility. Through Ozone Manager and Recon, object growth, namespace health, container placement, and cluster state can be inspected and correlated directly with table activity. This makes it easier to reason about, operate, and debug metadata-intensive lakehouse workloads. In practice, what Apache Ozone brings is a storage foundation that aligns with the demands of modern table formats: scalable object storage, consistent metadata management, and first-class observability. When those properties are present, formats like Apache Iceberg can focus entirely on table-level semantics - while Apache Ozone reliably handles everything underneath.
... View more
12-04-2025
09:32 AM
Table of Contents Apache Iceberg began as an internal project at Netflix to solve a very real problem: analytical data lakes built on file formats like Apache Parquet and ORC lacked reliable schema evolution, atomicity, and visibility into table state. To address these challenges, the team defined a specification that could enforce consistency and transactional guarantees at scale - introducing Iceberg as the “table format for slow-moving data.” If we take a step back, most early “big data” workloads relied on the Apache Hive table format stored in the Hadoop Distributed File System (HDFS). Over time, the Hive table format began to show its limitations - especially at Netflix’s scale and as workloads moved to cloud object stores like Amazon S3. One of the key design challenges was that Hive relied on directory structures to define tables, where each partition mapped to a folder and its files represented the data. This design worked well on HDFS, where directory listings were fast and consistent. But on S3, listing millions of files across nested partitions became slow and costly. Also, since S3 stores data as independent objects in a flat structure, it can throttle requests when too many target the same prefix. In a typical Hive table layout where partitions share a common prefix, concurrent listings or writes can easily hit S3 request-rate limits, causing throttling and 5xx errors. And there were other limitations with concurrency, corrupted table states, stale statistics, etc. that needed a fresh focus. These motivated the development of Iceberg with some clear design goals: Ensure table correctness and consistency, even with concurrent writes. Enable faster query planning and execution without full directory scans. Allow users to ignore the physical layout of files and focus on logical data. Support schema and table evolution safely. Accomplish all of this at cloud scale. Iceberg’s answer was to redefine what a “table” means on a data lake. Instead of treating all files under a directory as the table, Iceberg introduced the concept of a canonical list of data files, tracked through metadata, manifests and snapshots. Each commit represents a complete, immutable view of the table at a point in time, enabling atomic operations, time travel, and isolation without relying on brittle directory structures. This design not only addressed the core limitations of Hive-style tables but also laid the foundation for the open lakehouse architecture. By separating the logical definition of a table from its physical layout, Iceberg made it possible for multiple compute engines to operate on the same dataset with full transactional guarantees. Iceberg’s open, versioned table specification ensures that these capabilities are not tied to any single engine or vendor, allowing consistent behavior and interoperability across diverse systems. From there, Iceberg evolved through successive specifications (v1, v2, v3, and the upcoming v4), each extending its capabilities to support new data types, row-level operations, and other innovations that developers now rely on in production-grade open lakehouses. Evolution of the Apache Iceberg Table format Specification With that bit of history, let’s go understand how Iceberg’s specification has evolved over time and what can engineers/developers look forward to in the upcoming one. Spec v1: How Iceberg Enabled Analytic Tables at Scale The first specification, v1, defined how to manage large analytical tables on immutable file formats like Parquet, Avro, and ORC. Its core goal was to bring database-like guarantees (ACID), and schema evolution to the data lake, while staying agnostic to the underlying compute engine. v1 introduced a multi-layered metadata architecture: Data files stored the actual table content. Manifest files listed groups of data files and their statistics (min/max values, record counts, partition information). A manifest list tracked all manifests that together defined the current table snapshot. A metadata file pointed to the latest snapshot and table-level properties such as schema, partition spec, and snapshot lineage. This hierarchy made every commit an immutable snapshot. So, readers could rely on a consistent view while writers performed atomic updates using a new metadata file reference. It solved a critical problem that Hive couldn’t: snapshot isolation on object storage. Each table version could be reconstructed precisely using metadata alone, enabling time travel and rollback operations without duplicating data. Spec v1 also formalized schema evolution. Fields could be added, renamed, or deleted safely without rewriting existing data files, since schema and column IDs were tracked independently from file structure. This was a foundational change - decoupling physical data layout from logical schema made Iceberg a durable layer over immutable files. Through releases up to v0.11.0, this specification became the basis for large-scale analytical workloads. Spec v2: How Iceberg Supports Row-Level Deletes and Incremental Mutations By the time Iceberg reached v0.11.1, the community had voted to approve Spec v2 - a major step forward. While v1 worked well for immutable, append-only workloads, real-world production systems needed efficient row-level deletes and updates to handle CDC (Change Data Capture), GDPR deletions, and late-arriving data corrections. Spec v2 introduced delete files, which encode rows to be deleted or replaced within existing data files. This design enabled a new table-write pattern known as Merge-on-Read (MoR). Instead of rewriting entire data files, as done in the Copy-on-Write (CoW) pattern, MoR tables simply append new delete files and let readers reconcile them at query time. The merge process produces a consistent view by applying deletes over the underlying data files dynamically. There are two types of delete files that implement this behavior: Position deletes - mark rows for removal by their physical position in a data file Equality deletes - mark rows based on column values (for example, delete all rows where id = 123) This mechanism drastically reduced write amplification, making row-level updates practical on immutable storage - a foundational capability for streaming ingestion and near–real-time pipelines. Another subtle but important change in v2 was stricter writer guarantees. While atomic commits and snapshot isolation which form the foundation of Iceberg’s optimistic concurrency control (OCC) model, had existed since v1, Spec v2 reinforced this model by formalizing writer-side validation semantics. Writers were now required to validate their parent snapshot lineage during commit, strengthening the OCC-based transaction model and ensuring consistent behavior for concurrent row-level operations across engines. By v1.4.0, Iceberg adopted format-version = 2 as the default for new tables. At this point, Iceberg had evolved from a read-optimized analytical format into a mutable table layer that is capable of low-latency upserts, streaming merges, and transactional safety across distributed writers. Spec v3: Extended Types, Metadata Enhancements & Row Lineage As workloads diversified, the community began tackling a broader set of problems - handling semi-structured and geospatial data, adding lineage tracking, and improving deletion efficiency. These efforts culminated in Spec v3, reflected in Iceberg releases v1.8.0 to v1.10.0 (2025). The first wave of v3 features appeared in v1.8.0 (February 2025), introducing capabilities like: Binary Deletion Vectors (DVs): Store row-level delete information as compact binary bitmaps, removing the need for separate delete files and enabling faster merges. Variant Type: A flexible column type for semi-structured JSON-like data, allowing ingestion of untyped data without strict schema enforcement. New Geospatial Types: Support for geometry and geography data to power location analytics and mapping use cases. Nanosecond Precision Timestamps (with/without TZ): For event and telemetry workloads demanding precise temporal resolution. Later, v1.9.0 (April 2025) and v1.10.0 (September 2025) expanded v3’s depth: Row Lineage Tracking: metadata fields that allow engines to detect row-level changes between commits, which simplifies incremental processing. Default Values & Multi-Argument Transforms: Allow specifying default column values for schema evolution and defining partitioning or sorting transforms that use multiple input columns. Table Encryption Keys: Built-in encryption primitives to secure data at rest, complementing external key management systems. Spec Clarifications: Write requirements to prevent orphaned deletion vectors and ensure consistent reader behavior across engines. Why Do These Innovations Matter for Developers? The v3 spec opened new possibilities for how developers build and operate data pipelines. Many of the additions in this spec directly address pain points that emerge once systems reach scale and workloads increase. Let’s see a few real-world use cases where these new innovations add value. Incremental & CDC Processing Until recently, building incremental pipelines over immutable data was complicated and costly. Iceberg v3’s row lineage capabilities, through internal _row_id and _sequence_number metadata give each row a persistent identity across commits. Engines can now detect exactly which rows changed between snapshots. For developers, this makes CDC ingestion, materialized view refreshes, and downstream incremental transformations far simpler. Instead of rescanning entire partitions or maintaining complex diff logic, pipelines can now consume only what changed, enabling faster, cheaper, and more reliable refresh cycles. Efficient Row-Level Deletes and Read Performance Spec v2 enabled row-level deletes through external delete files, which reduced write amplification but created operational overhead. Many engines didn’t compact these files automatically, and in production tables they often accumulated over time, forcing readers to merge dozens of delete files during scans. Spec v3 introduces binary deletion vectors (DVs), allowing developers to handle row-level deletes efficiently without relying on manual compaction jobs. DVs store deleted-row markers as compact bitmaps linked to data files, simplifying maintenance, reducing read-time merging, and improving performance for mutation-heavy workloads such as CDC ingestion or streaming upserts. Dealing with Diverse & Evolving Data Modern pipelines increasingly mix structured, semi-structured, and geospatial data. Before Iceberg v3, developers had to flatten JSON into many nullable columns or store it as strings, which is both inefficient for filtering and query pushdown. Spec v3 introduces a VARIANT type that encodes semi-structured data natively, allowing engines to query nested fields without full scans or heavy parsing. It also adds GEOMETRY and GEOGRAPHY types for spatial data, enabling efficient spatial joins and region-based filtering directly on coordinates or shapes. Together with default column values, these extensions make schema evolution safer and allow diverse data types to be stored and queried efficiently within the same table. Optimizing Partitioning and Query Planning Earlier specs of Iceberg restricted partition transforms to a single column. For example, a bucket(16, user_id) transform could hash only one input field. In some cases though, developers often needed composite bucketing, like distributing data by a combination of columns (country, city) to improve data locality and query pruning. Spec v3 extends the partition specification to support multi-argument transforms, particularly for the bucket transform. This allows developers to define composite partition keys natively instead of encoding them into a single field. By enabling hashing and partitioning across multiple columns, query engines can plan scans more precisely and reduce shuffle during joins or aggregations. The result is better data clustering and lower compute overhead for large analytical workloads. Building Secure and Governed Tables Finally, Spec v3 introduces table-level encryption keys, giving organizations a consistent way to secure Iceberg tables at rest and manage key rotation policies. This is particularly relevant in multi-tenant or regulated environments where governance and compliance are critical. Spec v4: What’s Next for Developers Building with Iceberg Spec v4 is still in planning and proposals are being made, but the direction is clear. A lot of the work is focused on tightening the format around real pain points developers hit once tables get big, commits get frequent, and metadata becomes the bottleneck. Let’s take a look at four of these proposals that have had momentum from the Iceberg community. Single-File Commit Every Iceberg commit today involves writing multiple metadata layers - a new metadata.json, a manifest list, and one or more manifest files. This structure introduces unnecessary overhead for small or frequent writes. Even a small update requires multiple metadata rewrites, delete operations often trigger full manifest rewrites (with CoW), and caching manifests across commits becomes difficult since files are frequently replaced. Spec v4 proposes a simpler model built around a Root Manifest, which replaces the manifest list and acts as the single entry point for each snapshot. The hierarchy collapses into a clean two-level structure: Root Manifest -> Data Manifests / Delete Manifests / Files Each commit now modifies only what changed, keeping metadata growth proportional to the size of the operation rather than the size of the table. The benefits are immediate for developers: faster commits and fewer metadata rewrites. Query planning also improves, as the Root Manifest can aggregate file-level metrics from its children, allowing pruning to happen earlier. Together, these changes make Iceberg better suited for streaming and micro-batch workloads where commits are small but frequent. Storing Metadata in Parquet Since its early versions, Iceberg has stored metadata files in Apache Avro. That choice worked well when manifests were small and queries read them as whole records. But as tables have grown with hundreds of columns and thousands of file-level statistics, reading entire manifest rows has become expensive. Query engines often need just a subset of fields (for example, file paths and a single column’s min/max), but Avro forces them to deserialize the entire record. The community is now proposing transitioning metadata files to a columnar format with Apache Parquet. This change allows engines to read only the necessary columns, improving planning efficiency and memory usage. It also aligns metadata storage with data storage, unlocking optimizations like column pruning and predicate pushdown even for metadata queries. In combination with the new single-file commit model, this ensures that query planning remains fast even as metadata becomes richer and more expressive. Column Statistics Rework: Making Stats First-Class Another proposal focuses on redesigning how column statistics are represented. Currently, stats for each column, such as lower and upper bounds, null counts, and value sizes are stored as a generic map from field IDs to values. While functional, this approach creates several problems: it’s inefficient for wide tables, loses type information during serialization, and makes it hard to project only specific stats. This new proposal introduces a typed, structured representation of column stats. Each field’s statistics will be stored with preserved logical and physical types, making them more reliable through schema evolution. Engines will be able to read individual stats (for example, just the lower bounds for a few columns) without loading everything into memory. This change also makes statistics extensible, i.e. developers will be able to attach richer per-field metrics for emerging data types like VARIANT or GEOMETRY, and query engines can use them for smarter pruning. In practice, this means more predictable performance on wide, evolving schemas, and better planning efficiency for mixed workloads. Relative Paths This has been a long-standing operational issue - Iceberg stores all file paths as absolute URIs. This can be challenging when you need to move a table - between buckets, regions, or even storage systems, where each embedded path must be rewritten. For replication, disaster recovery, or multi-region deployments, this has been cumbersome. Spec v4 proposes support for relative paths within table metadata. By storing references relative to the table root, Iceberg tables can be moved or copied without rewriting metadata. The internal relationships between data and metadata files remain consistent, and absolute paths can still be used where needed for external data. This makes replication, backup, and migration simpler. Community Drives Innovation The pace of Iceberg’s evolution is largely a reflection of its community. Every new capability has emerged from real operational challenges faced by developers across organizations. Users and developers from companies like Netflix, Apple, Dremio, Tabular, Snowflake, Cloudera, and many others contribute code, specifications, and design reviews. Each change is first proposed as a public discussion or design document, iterated on by the community, and only then voted into the specification. This process ensures that new features address genuine production problems, rather than being driven by any single engine or vendor. It also creates a rapid feedback loop - as developers deploy Iceberg at scale, they bring back real-world lessons into design discussions. That’s why the features in recent specs, e.g. deletion vectors, root-level manifests, and columnar metadata can be traced directly to patterns seen in high-throughput streaming and batch environments. Iceberg has matured from solving basic data management challenges on the data lake to defining how transactional workloads operate in open data environments. Each specification has extended the boundaries of what a table format can do - adding the kind of features developers expect from databases and warehouses into the lakehouse architecture, while offering openness and interoperability. The result is a format that now underpins a wide spectrum of workloads: batch analytics, incremental processing, streaming ingestion, and AI pipelines. Cloudera’s Commitment to the Iceberg Journey Cloudera introduced native support for Apache Iceberg in its public cloud Lakehouse platform in 2021, extending it to on-premises deployments the following year. Since then, Iceberg has become central to how customers build modern data architectures across hybrid and multi-cloud environments. Today, petabytes of data are managed in Iceberg tables on Cloudera, powering everything from near real-time analytics and regulatory compliance workloads to AI data preparation and large-scale data engineering pipelines. Alongside this, the newly launched Cloudera Lakehouse Optimizer automates table maintenance operations that would otherwise require manual tuning. It continuously manages small file compaction, manifest rewriting, and layout optimization, improving query performance and reducing storage costs. For engineers, this means less operational overhead - no babysitting tables, no manual compaction or cleanup, while maintaining the same consistency guarantees across all Iceberg-compatible engines. By aligning open table formats with enterprise-grade governance, hybrid deployment flexibility, and automated optimization, Cloudera's platform aims to make it simpler for developers and data teams to adopt Iceberg confidently across environments. Cloudera also contributes to the Apache Iceberg community through ongoing code contributions, community initiatives such as meetups, and developer-focused educational resources. Join our Community!
... View more
Labels: