Developer Blogs

dipankartnt · ‎03-02-2026

A few days back we co-hosted a Lakehouse meetup in New York City with Cloudflare and LanceDB, bringing together folks from the Apache Iceberg, Apache DataFusion, and Lance communities.

17ac7625-ebee-485e-83d0-3369bb09112d.png-2.png

At Cloudera, we view supporting and running open source meetups as critical drivers for the ecosystem itself and we have done this for a while now. Open source projects like Iceberg, DataFusion, and Lance evolve through real-world feedback, and those conversations happen best when practitioners are in the same room. These meetups create that perfect space, where design decisions are debated, trade-offs are brought in, and implementation realities are discussed openly. This event was a good example of that!

We had three dedicated talks targeting these three projects and opened up the rest of the time for community networking.

Iceberg Spec Evolution - v1 to v4 & how Cloudera supports

IMG_8772 (1).jpg

I kicked off the evening with a session on Apache Iceberg’s specification evolution. We walked through how specs v1 and v2 addressed foundational table abstraction and row-level operations, then spent more time on v3 and the ongoing v4 proposals. We spent time discussing:

Binary Deletion Vectors: Moving beyond positional deletes to binary bitmaps. This is a game-changer for write-heavy workloads, significantly reducing the I/O overhead of row-level updates.
Row-Level Lineage: The introduction of stable row identifiers in v3, which finally makes true Change Data Capture (CDC) and incremental processing feel native to the table format.
The V4 Horizon: Touched on the active proposals for v4, including single-file commit to simplify metadata writes and tackle metadata bloat issues, using Parquet for metadata (replacing Avro) to allow for columnar metadata reads. This would let engines skip even more data by only loading the specific metadata fields they need for a query plan, among other things.

The focus was about understanding why each change was introduced and how it affects developers building lakehouse pipelines at scale. A lot of the discussion centered around practical implications: metadata growth, mutation patterns, execution behavior, and how these changes surface in real deployments.

We also touched on how Cloudera has supported Iceberg’s core capabilities from early on - and what it means to support spec evolution inside a production platform. We have navigated the transition from the early v1 specs, and now as the community pushes into the v3/v4 frontier, our focus remains on making those powerful new capabilities - like deletion vectors and row-level lineage available.

The questions from the room reflected that people are actively thinking about these new opportunities and about Iceberg’s adoption in data platforms.

Cloudflare’s Data Platform: Iceberg + DataFusion

Jonathan Chen from Cloudflare then introduced their new data platform built on R2, R2 Data Catalog, R2 SQL, and Pipelines. The architecture combines object storage, Apache Iceberg as the table layer, and Apache DataFusion as the query engine - enabling ingestion and SQL analytics directly over object storage.

Jonathan explained how the R2 SQL engine (built on Apache DataFusion) uses a scatter-gather architecture to run analytics directly on Iceberg tables stored in R2 and how the engine can now handle aggregations (SUM, COUNT, etc.) and complex JOINs without the data ever leaving the Cloudflare network. It’s a compelling look at a "no-infra" future where the storage itself is smart enough to answer your SQL queries.

Multimodal AI Lakehouse with Lance

Chang She (Co-founder & CEO of LanceDB) brought a different flavor to the conversation: the multimodal challenge. While Iceberg handles our analytical tables, how do we handle the billions of embeddings and blobs required for modern AI?

The next wave of AI (think Midjourney, WorldLabs, and Runway) requires seamless, scalable access to much more than just numbers and strings. We’re talking about text, images, embeddings, and complex modalities. Chang introduced Lance, a columnar data format optimized specifically for AI, and LanceDB, the multimodal lakehouse built on top of it. The highlight was also the new "branch/tag" capability. It essentially functions as "Git for AI Data," allowing data scientists to create zero-copy clones of massive datasets for experimentation. This means you can branch a production dataset, run a fine-tuning job or a transformation experiment, and then either merge or discard it.

Final Thoughts

If you have been to a great meetup, you know the schedule is only half the story. After the talks, a lot of folks came up to continue the conversation - especially around what’s coming next in Apache Iceberg and how to think about the upcoming spec direction. There were also good side discussions sparked by the Lance session, particularly around multimodal workloads and how teams are starting to think about vectors, text, and images alongside tabular lakehouse tables.

This is exactly why we at Cloudera Community care about these meetups. The value isn’t only the presentations - it’s the direct, candid conversations that happen around them, across projects and across communities.

It goes without saying that an event like this doesn't just happen. This meetup has been a long time coming, growing from early discussions between myself, Prashanth, ChanChan, and Jonathan about the need for a dedicated data infra meetup in the city.

A massive thanks to the Cloudflare team for providing such a great space, and a huge shoutout to Cloudera, LanceDB, and Cloudflare for their support in making this a reality. Join us at Cloudera Community to keep a track of all the meetups/events.

Developer Blogs

Cloudera Open Source Meetup Series: Bringing the Open Lakehouse Community Together in NYC

Cloudflare’s Data Platform: Iceberg + DataFusion

Final Thoughts