Developer Blogs

Announcements
We’ve updated our product names and community labels - click here for full details

How we Helped make Apache Iceberg Cool - Before the Hype!

avatar
Cloudera Employee

iceberg_alex_dip.png

This blog is written in collaboration with Alex Merced - Head of DevRel @Dremio

There was a time when talking about “Open table formats” would get you a polite nod… and then the conversation would move on. This was of course before “lakehouse” became a strategy slide in every vendor or enterprise deck. And before modular data architectures were a mainstream discussion. Back then, the idea that storage format layers should be open  felt abstract and theoretical. Maybe even non-relevant in the greater scope of things!

We were told, more than once:

“Why does this matter?”
“Isn’t Parquet enough?”
“We already have a Data Warehouse & Data Lake”
“Just use Delta”

And to be fair,  those weren’t unreasonable questions at the time. Apache Iceberg wasn’t a new query engine. It wasn’t a new database or a shiny AI model. It was basically metadata in a more technical sense. And an open table format specification. Unfortunately, these things have never been in vogue - until users started seeing what it truly means!

The Problem Nobody Thought Was a Problem

One of the interesting things about the early Iceberg conversations is that most people didn’t believe there was a structural issue in the first place. On one hand, cloud data warehouses were widely adopted and became that centralized repository for structured BI workloads and on the other hand, data lakes with Apache Parquet as the file format and Apache Hive as the table format became the standard for serving AI use cases. Databricks Delta Lake (proprietary version of the Delta Lake table format) was also in use by customers that were already in the Databricks  (or Azure Databricks) ecosystem. From the outside, things seemed fine.

So when we began advocating for an open table format with a formal specification, and spoke about things like snapshot isolation, structured metadata trees, and partition evolution, the reaction was often confusion rather than resistance. Many people simply didn’t see what we wanted them to see. And that was because the pain rarely showed up as “our table abstraction is flawed.”

It actually showed up in much more practical, day-to-day complaints. Things like - 

  • Our Spark job is slow
  • Listing partitions on S3 takes forever
  • We can’t change partitioning without rewriting everything
  • Schema evolution broke our downstream job
  • Two jobs wrote to the same table and now we have corrupted data

These were treated as performance problems, operational mistakes, or scaling challenges. Teams would add scripts. They would add locks. They would document rules  explaining how to safely evolve schemas or manage partitions. Over time, those workarounds hardened into architectural debt.

One of the main reasons for these issues was that a “table” was never a first-class abstraction in data lakes. It was an agreement layered on top of the file system. You had directories containing Parquet files, and a Hive table pointing to that directory. Table semantics were effectively determined by the engine that interpreted them. When datasets were small, listing files from the file system was tolerable and  when only one engine wrote to the table, behavioral assumptions held. But as volumes increased and workloads diversified, the cracks began to show.

For engineers working with cloud data warehouses, these issues didn’t matter as the storage layer has always been abstracted by vendors with proprietary file and table formats and maintenance was taken care of by the warehouse’s cleaner. The most common problem for these users was - data getting locked into the vendor systems and the super high costs that came with it.

Iceberg’s proposition was not about just faster queries or shinier features. It was about formalizing the table format as an independent layer with an open specification for all compute engines to abide by. This open specification ultimately became the solution to the problems we discussed.

The Early Struggles

If the technical argument behind Iceberg was subtle, the real challenge was cultural. We were not introducing a solution that replaced something obvious. We were introducing a new idea that required people to rethink their storage layers and the impact they have. Unfortunately, that kind of shift does not happen through just feature highlights or benchmarks - it requires changing mental models.

During that time, most engineers did not wake up thinking, “I need an open table format.” Their goal was to ship pipelines, optimize jobs, reduce storage costs, or stabilize production workloads. The storage layer was not top of mind. So when we started speaking about formal specifications, snapshot isolation, and metadata trees, we were effectively asking people to care about the invisible foundation beneath their existing systems - the historically abstracted stuff that we mentioned before. 

We realized fairly quickly that education was going to be a crucial aspect for engineers to understand the need for something like Iceberg/Open Lakehouse. We had to explain what a table abstraction actually means in a data lake, why relying on proprietary storage systems brings long-term constraints, and why modularity and openness was critical. From Iceberg’s technical architecture POV, we had to explain how the metadata tree structure brings snapshot isolation, how hidden partitioning worked, and why concurrency control models were imperative to running multiple workloads on the same table. Note that these were not surface-level topics. They required long-form explanations, breaking down complex things in the form of diagrams, and showing how specific use cases can be implemented with Iceberg. 

At conferences, the reception was thoughtful but measured. The questions were, sometimes skeptical, but often practical: How does this compare to Hive? How is this different from Delta? Are we introducing another layer? Oftentimes, half the effort was simply clarifying what Iceberg was and what it was not. But those questions were important as they forced us to sharpen the articulation of the value proposition.

How we Evangelized Iceberg?

In the early days of Apache Iceberg adoption, formal evangelism around the project was almost entirely driven by the engineers building it. Outside of committers and contributors, there was little dedicated effort focused on education, storytelling, or community-building around Iceberg as a standalone technology.

That began to change when Alex Merced joined Dremio in December 2021, followed by Dipankar Mazumdar in February 2022 as  the first Developer Advocates with a primary focus on Apache Iceberg.

What followed was an organic and formative period of experimentation. There was no established playbook for how to “evangelize” a table format. Instead, advocacy meant learning in public, translating deep technical concepts into practical guidance, and showing up consistently wherever the data community gathered.

Once it became clear that this would be a long-term effort, the approach became deliberate. If the table format abstraction was unfamiliar, we had to make it understandable. If the ecosystem lacked vocabulary, we had to build it.

So, we organized our efforts around a few core pillars.

Foundational Blogs & Hands-on Exercises

The first pillar was long-form technical writing. This was deliberately targeted at explaining the core architectural concepts of Iceberg, how it fundamentally worked, and how it compared with other formats at the time. We wrote deep technical blogs explaining how Iceberg looked under the hood, and how  read and write queries worked. We unpacked how hidden partitioning helped with fewer accidental table scans. We went over Iceberg’s key features and why Puffin file format was introduced and was necessary for additional statistics that could help with performance improvements. But we quickly realized that reading alone was not enough and engineers needed to see and run things. So alongside the blogs, we built hands-on exercises. This repository became a practical companion to the writing - a place where readers could experiment with table creation, schema evolution, partition changes, and understand the internal behavior themselves.

Webinars/Podcasts

The second pillar was live education. We ran webinars/podcasts focused on the mechanics of Iceberg. These provided space for live demos, deeper dives, and real-time Q&A, often revealing where understanding of Iceberg was still unclear. Early sessions were as much learning experiences for presenters as for attendees. Much of the time was spent dissecting real questions about how the system behaves under scale, concurrency, and multi-engine access.

Over time, these sessions became less about explanation and more about application. Engineers began arriving with their own workload patterns and architectural constraints, asking how Iceberg would behave in their system rather than in a generic example.

That evolution led us to introduce dedicated office hours. Instead of one-to-many presentations, we created open forums where engineers could bring specific production scenarios, performance issues, or how-to questions. The goal was to reduce friction for real adopters and make the abstraction practical, not theoretical.

Those office hours became one of the most important feedback loops. They exposed edge cases, clarified documentation gaps, and often influenced how we explained Iceberg going forward.

Conferences & Ecosystem Conversations

Conferences played quite a different role than blogs or webinars. We were showing up in places where Iceberg (or Table Format) was not yet the dominant topic.Our talks often required contextual framing before diving into technical details. We had to explain the problem space before explaining the solution. That meant sessions focused as much on open lakehouse architecture as on Iceberg itself.

But conferences were not just educational moments - they were ecosystem checkpoints. These events brought together practitioners from different organizations who were solving similar problems in parallel. So, naturally, the conversations moved beyond “How does this feature work?” toward “How are you running this in production?” and “What are you doing for compaction at scale?” So, we were onto the next level, and instead of debating whether the table abstraction was needed, engineers were comparing operational strategies. 

Conferences actually became a place where ecosystem alignment formed in public. Different vendors, contributors, and adopters were discussing the same specification, the same semantics, and the same trade-offs. It reinforced the idea that Iceberg was not tied to a single company’s roadmap but was evolving as an independent specification.

Books and Research Paper

By this point, we had already produced a significant body of written material - deep technical blogs, architectural breakdowns, hands-on guides. All of these were amazing to see but we wanted to consolidate these into something more formal and durable.

One outcome of that consolidation was publishing formal research work, including the paper “The Data Lakehouse: Data Warehousing and More”. It was a way to capture the “why” behind the open lakehouse paradigm and compare it to traditional database systems like data warehouses.

In parallel, there was also a clear need for something even more practitioner-oriented and complete: a definitive reference that engineers could keep on their desk. That’s where the work on Apache Iceberg: The Definitive Guide came in. A book is quite different from everything else. It forces you to organize the subject end-to-end: table structure, metadata layers, table operations, performance patterns, and production practices.

Community Building & Public Collaboration

We want to stress that while the blogs, talks, and other educational materials helped explain Iceberg to the masses, the real inflection came from the diverse community that began forming around it.

Iceberg was never positioned as a vendor-owned format and its specification evolved in public. Committers from different organizations made proposals in the open. Integrations across engines & systems were built by contributors with different priorities and production realities. The mailing lists, Slack channels, and conference hallways became places where real-world lessons were exchanged. And our role in that process was to amplify these narratives and connect it.

By consistently highlighting community contributions, helping users on Iceberg Slack, inviting committers and adopters to speak in webinars and conferences, and creating spaces like office hours for open discussion, we tried to make participation visible and accessible. But the exchange was never one-directional.

Those conversations sharpened our own understanding. For e.g. the office hours brought problems we hadn’t considered, slack discussions revealed documentation gaps, and talks by external adopters brought real production constraints into the spotlight. The more the community engaged, the more grounded and precise our messaging became.

Closing Reflections

Looking back, the early days of Apache Iceberg evangelism were defined by experimentation, curiosity, and a strong belief in open standards. Without a clear roadmap, advocacy evolved through blogs, webinars, events, books, and countless conversations with the data community trying to make sense of a rapidly changing data landscape. What began as a small effort to explain a new table format grew into sustained engagement with a global lakehouse community.

If you are interested in continuing to learn more about Apache Iceberg from Dipankar and Alex, follow these channels:

Below this post, you’ll find a list of published works that Alex and Dipankar have been part of over the years. We encourage you to explore these writings, talks, and recordings to see how the ideas around Apache Iceberg and the lakehouse have evolved through that work.