Created on 12-04-2025 09:32 AM - edited 12-08-2025 07:40 AM
Table of Contents
Apache Iceberg began as an internal project at Netflix to solve a very real problem: analytical data lakes built on file formats like Apache Parquet and ORC lacked reliable schema evolution, atomicity, and visibility into table state. To address these challenges, the team defined a specification that could enforce consistency and transactional guarantees at scale - introducing Iceberg as the “table format for slow-moving data.”
If we take a step back, most early “big data” workloads relied on the Apache Hive table format stored in the Hadoop Distributed File System (HDFS). Over time, the Hive table format began to show its limitations - especially at Netflix’s scale and as workloads moved to cloud object stores like Amazon S3. One of the key design challenges was that Hive relied on directory structures to define tables, where each partition mapped to a folder and its files represented the data.
This design worked well on HDFS, where directory listings were fast and consistent. But on S3, listing millions of files across nested partitions became slow and costly. Also, since S3 stores data as independent objects in a flat structure, it can throttle requests when too many target the same prefix. In a typical Hive table layout where partitions share a common prefix, concurrent listings or writes can easily hit S3 request-rate limits, causing throttling and 5xx errors. And there were other limitations with concurrency, corrupted table states, stale statistics, etc. that needed a fresh focus.
These motivated the development of Iceberg with some clear design goals:
Iceberg’s answer was to redefine what a “table” means on a data lake. Instead of treating all files under a directory as the table, Iceberg introduced the concept of a canonical list of data files, tracked through metadata, manifests and snapshots. Each commit represents a complete, immutable view of the table at a point in time, enabling atomic operations, time travel, and isolation without relying on brittle directory structures.
This design not only addressed the core limitations of Hive-style tables but also laid the foundation for the open lakehouse architecture. By separating the logical definition of a table from its physical layout, Iceberg made it possible for multiple compute engines to operate on the same dataset with full transactional guarantees. Iceberg’s open, versioned table specification ensures that these capabilities are not tied to any single engine or vendor, allowing consistent behavior and interoperability across diverse systems. From there, Iceberg evolved through successive specifications (v1, v2, v3, and the upcoming v4), each extending its capabilities to support new data types, row-level operations, and other innovations that developers now rely on in production-grade open lakehouses.
With that bit of history, let’s go understand how Iceberg’s specification has evolved over time and what can engineers/developers look forward to in the upcoming one.
The first specification, v1, defined how to manage large analytical tables on immutable file formats like Parquet, Avro, and ORC. Its core goal was to bring database-like guarantees (ACID), and schema evolution to the data lake, while staying agnostic to the underlying compute engine. v1 introduced a multi-layered metadata architecture:
This hierarchy made every commit an immutable snapshot. So, readers could rely on a consistent view while writers performed atomic updates using a new metadata file reference. It solved a critical problem that Hive couldn’t: snapshot isolation on object storage. Each table version could be reconstructed precisely using metadata alone, enabling time travel and rollback operations without duplicating data.
Spec v1 also formalized schema evolution. Fields could be added, renamed, or deleted safely without rewriting existing data files, since schema and column IDs were tracked independently from file structure. This was a foundational change - decoupling physical data layout from logical schema made Iceberg a durable layer over immutable files. Through releases up to v0.11.0, this specification became the basis for large-scale analytical workloads.
By the time Iceberg reached v0.11.1, the community had voted to approve Spec v2 - a major step forward. While v1 worked well for immutable, append-only workloads, real-world production systems needed efficient row-level deletes and updates to handle CDC (Change Data Capture), GDPR deletions, and late-arriving data corrections.
Spec v2 introduced delete files, which encode rows to be deleted or replaced within existing data files. This design enabled a new table-write pattern known as Merge-on-Read (MoR). Instead of rewriting entire data files, as done in the Copy-on-Write (CoW) pattern, MoR tables simply append new delete files and let readers reconcile them at query time. The merge process produces a consistent view by applying deletes over the underlying data files dynamically.
There are two types of delete files that implement this behavior:
This mechanism drastically reduced write amplification, making row-level updates practical on immutable storage - a foundational capability for streaming ingestion and near–real-time pipelines.
Another subtle but important change in v2 was stricter writer guarantees. While atomic commits and snapshot isolation which form the foundation of Iceberg’s optimistic concurrency control (OCC) model, had existed since v1, Spec v2 reinforced this model by formalizing writer-side validation semantics. Writers were now required to validate their parent snapshot lineage during commit, strengthening the OCC-based transaction model and ensuring consistent behavior for concurrent row-level operations across engines.
By v1.4.0, Iceberg adopted format-version = 2 as the default for new tables. At this point, Iceberg had evolved from a read-optimized analytical format into a mutable table layer that is capable of low-latency upserts, streaming merges, and transactional safety across distributed writers.
As workloads diversified, the community began tackling a broader set of problems - handling semi-structured and geospatial data, adding lineage tracking, and improving deletion efficiency. These efforts culminated in Spec v3, reflected in Iceberg releases v1.8.0 to v1.10.0 (2025).
The first wave of v3 features appeared in v1.8.0 (February 2025), introducing capabilities like:
Later, v1.9.0 (April 2025) and v1.10.0 (September 2025) expanded v3’s depth:
The v3 spec opened new possibilities for how developers build and operate data pipelines. Many of the additions in this spec directly address pain points that emerge once systems reach scale and workloads increase. Let’s see a few real-world use cases where these new innovations add value.
Until recently, building incremental pipelines over immutable data was complicated and costly. Iceberg v3’s row lineage capabilities, through internal _row_id and _sequence_number metadata give each row a persistent identity across commits. Engines can now detect exactly which rows changed between snapshots. For developers, this makes CDC ingestion, materialized view refreshes, and downstream incremental transformations far simpler. Instead of rescanning entire partitions or maintaining complex diff logic, pipelines can now consume only what changed, enabling faster, cheaper, and more reliable refresh cycles.
Spec v2 enabled row-level deletes through external delete files, which reduced write amplification but created operational overhead. Many engines didn’t compact these files automatically, and in production tables they often accumulated over time, forcing readers to merge dozens of delete files during scans. Spec v3 introduces binary deletion vectors (DVs), allowing developers to handle row-level deletes efficiently without relying on manual compaction jobs. DVs store deleted-row markers as compact bitmaps linked to data files, simplifying maintenance, reducing read-time merging, and improving performance for mutation-heavy workloads such as CDC ingestion or streaming upserts.
Modern pipelines increasingly mix structured, semi-structured, and geospatial data. Before Iceberg v3, developers had to flatten JSON into many nullable columns or store it as strings, which is both inefficient for filtering and query pushdown. Spec v3 introduces a VARIANT type that encodes semi-structured data natively, allowing engines to query nested fields without full scans or heavy parsing. It also adds GEOMETRY and GEOGRAPHY types for spatial data, enabling efficient spatial joins and region-based filtering directly on coordinates or shapes. Together with default column values, these extensions make schema evolution safer and allow diverse data types to be stored and queried efficiently within the same table.
Earlier specs of Iceberg restricted partition transforms to a single column. For example, a bucket(16, user_id) transform could hash only one input field. In some cases though, developers often needed composite bucketing, like distributing data by a combination of columns (country, city) to improve data locality and query pruning. Spec v3 extends the partition specification to support multi-argument transforms, particularly for the bucket transform. This allows developers to define composite partition keys natively instead of encoding them into a single field. By enabling hashing and partitioning across multiple columns, query engines can plan scans more precisely and reduce shuffle during joins or aggregations. The result is better data clustering and lower compute overhead for large analytical workloads.
Finally, Spec v3 introduces table-level encryption keys, giving organizations a consistent way to secure Iceberg tables at rest and manage key rotation policies. This is particularly relevant in multi-tenant or regulated environments where governance and compliance are critical.
Spec v4 is still in planning and proposals are being made, but the direction is clear. A lot of the work is focused on tightening the format around real pain points developers hit once tables get big, commits get frequent, and metadata becomes the bottleneck.
Let’s take a look at four of these proposals that have had momentum from the Iceberg community.
Every Iceberg commit today involves writing multiple metadata layers - a new metadata.json, a manifest list, and one or more manifest files. This structure introduces unnecessary overhead for small or frequent writes. Even a small update requires multiple metadata rewrites, delete operations often trigger full manifest rewrites (with CoW), and caching manifests across commits becomes difficult since files are frequently replaced. Spec v4 proposes a simpler model built around a Root Manifest, which replaces the manifest list and acts as the single entry point for each snapshot. The hierarchy collapses into a clean two-level structure:
Root Manifest -> Data Manifests / Delete Manifests / Files
Each commit now modifies only what changed, keeping metadata growth proportional to the size of the operation rather than the size of the table. The benefits are immediate for developers: faster commits and fewer metadata rewrites. Query planning also improves, as the Root Manifest can aggregate file-level metrics from its children, allowing pruning to happen earlier. Together, these changes make Iceberg better suited for streaming and micro-batch workloads where commits are small but frequent.
Since its early versions, Iceberg has stored metadata files in Apache Avro. That choice worked well when manifests were small and queries read them as whole records. But as tables have grown with hundreds of columns and thousands of file-level statistics, reading entire manifest rows has become expensive. Query engines often need just a subset of fields (for example, file paths and a single column’s min/max), but Avro forces them to deserialize the entire record.
The community is now proposing transitioning metadata files to a columnar format with Apache Parquet. This change allows engines to read only the necessary columns, improving planning efficiency and memory usage. It also aligns metadata storage with data storage, unlocking optimizations like column pruning and predicate pushdown even for metadata queries. In combination with the new single-file commit model, this ensures that query planning remains fast even as metadata becomes richer and more expressive.
Another proposal focuses on redesigning how column statistics are represented. Currently, stats for each column, such as lower and upper bounds, null counts, and value sizes are stored as a generic map from field IDs to values. While functional, this approach creates several problems: it’s inefficient for wide tables, loses type information during serialization, and makes it hard to project only specific stats.
This new proposal introduces a typed, structured representation of column stats. Each field’s statistics will be stored with preserved logical and physical types, making them more reliable through schema evolution. Engines will be able to read individual stats (for example, just the lower bounds for a few columns) without loading everything into memory. This change also makes statistics extensible, i.e. developers will be able to attach richer per-field metrics for emerging data types like VARIANT or GEOMETRY, and query engines can use them for smarter pruning. In practice, this means more predictable performance on wide, evolving schemas, and better planning efficiency for mixed workloads.
This has been a long-standing operational issue - Iceberg stores all file paths as absolute URIs. This can be challenging when you need to move a table - between buckets, regions, or even storage systems, where each embedded path must be rewritten. For replication, disaster recovery, or multi-region deployments, this has been cumbersome.
Spec v4 proposes support for relative paths within table metadata. By storing references relative to the table root, Iceberg tables can be moved or copied without rewriting metadata. The internal relationships between data and metadata files remain consistent, and absolute paths can still be used where needed for external data. This makes replication, backup, and migration simpler.
The pace of Iceberg’s evolution is largely a reflection of its community. Every new capability has emerged from real operational challenges faced by developers across organizations. Users and developers from companies like Netflix, Apple, Dremio, Tabular, Snowflake, Cloudera, and many others contribute code, specifications, and design reviews. Each change is first proposed as a public discussion or design document, iterated on by the community, and only then voted into the specification.
This process ensures that new features address genuine production problems, rather than being driven by any single engine or vendor. It also creates a rapid feedback loop - as developers deploy Iceberg at scale, they bring back real-world lessons into design discussions. That’s why the features in recent specs, e.g. deletion vectors, root-level manifests, and columnar metadata can be traced directly to patterns seen in high-throughput streaming and batch environments.
Iceberg has matured from solving basic data management challenges on the data lake to defining how transactional workloads operate in open data environments. Each specification has extended the boundaries of what a table format can do - adding the kind of features developers expect from databases and warehouses into the lakehouse architecture, while offering openness and interoperability. The result is a format that now underpins a wide spectrum of workloads: batch analytics, incremental processing, streaming ingestion, and AI pipelines.
Cloudera introduced native support for Apache Iceberg in its public cloud Lakehouse platform in 2021, extending it to on-premises deployments the following year. Since then, Iceberg has become central to how customers build modern data architectures across hybrid and multi-cloud environments. Today, petabytes of data are managed in Iceberg tables on Cloudera, powering everything from near real-time analytics and regulatory compliance workloads to AI data preparation and large-scale data engineering pipelines.
Alongside this, the newly launched Cloudera Lakehouse Optimizer automates table maintenance operations that would otherwise require manual tuning. It continuously manages small file compaction, manifest rewriting, and layout optimization, improving query performance and reducing storage costs. For engineers, this means less operational overhead - no babysitting tables, no manual compaction or cleanup, while maintaining the same consistency guarantees across all Iceberg-compatible engines.
By aligning open table formats with enterprise-grade governance, hybrid deployment flexibility, and automated optimization, Cloudera's platform aims to make it simpler for developers and data teams to adopt Iceberg confidently across environments.
Cloudera also contributes to the Apache Iceberg community through ongoing code contributions, community initiatives such as meetups, and developer-focused educational resources. Join our Community!