Community Announcements

VidyaSargur · ‎07-22-2024

CommunityOverCode (formerly known as ApacheCon) is the official global conference series of the Apache Software Foundation (ASF). Since 1998, before the ASF’s incorporation, ApacheCon has been drawing participants at all levels to explore ”Tomorrow’s Technology Today” across 300+ Apache projects and their diverse communities. CommunityOverCode showcases the latest developments in Apache projects and emerging innovations through hands-on sessions, keynotes, real-world case studies, trainings, hackathons, and more.

CommunityOverCode showcases the latest breakthroughs from ubiquitous Apache projects and upcoming innovations in the Apache Incubator, as well as open source development and leading community-driven projects the Apache way. Attendees learn about core open source technologies independent of business interests, corporate biases, or sales pitches.

The CommunityOverCode program is dynamic and evolving at each event, with content directly driven by select Apache project developer and user communities. CommunityOverCode delivers state-of-the-art content that features the latest open source advances in big data, cloud, community development, FinTech, IoT, machine learning, messaging, programming, search, security, servers, streaming, web frameworks, and more in a collaborative, vendor-neutral environment.

(DISCLAIMER: © 2021-2024 The Apache Software Foundation under the terms of the Apache License 2.0. Apache, the Apache feather logo, and the Apache logo are either registered trademarks or trademarks of The Apache Software Foundation in the United States and other countries.)

Cloudera’s commitment to open source includes over 200 chairs, committers, and contributors to over 55 projects. This enables users of Cloudera to benefit from open source software with open standards, open APIs, and open ecosystem for integration into the broader data community, preventing Vendor lock-in. Cloudera supports the diverse communities that contribute to each project, which increases the speed and scope of innovation and ensures collaboration

Cloudera will present sessions on Apache Iceberg, Apache Ozone, Apache Impala, Apache Hive, and Apache NIFI.

Apache HBase on Ozone

Sammi Chen

ROOM 4: Fri, 2:30 pm–3:00 pm

Apache Ozone is the next-generation distributed storage system in the Hadoop ecosystem. It exposes both the Hadoop Compatible File System API and S3 protocol compatible API. With the latest 1.4.0 version, Ozone has supported Apache Spark, YARN, Hive, and Impala smoothly. Apache HBase is a Bigtable-like data store, playing a very important role in the Hadoop ecosystem. In this session, I will share how the feature is designed, the challenges we have met, and the latest status of the development.

Impala discovers Iceberg Metadata Tables

Daniel Becker

ROOM 5: Fri, 3:45 pm–4:15 pm

Storing extensive metadata is one of the key features of the Apache Iceberg table format, helping query engines plan and execute queries efficiently. As Iceberg provides an API to query this metadata, it can be presented in query engines as a set of virtual tables, which can be queried with SQL, including filtering, aggregation and joins with other metadata or even regular tables. This feature provides an invaluable table maintenance tool for database administrators.

During the past year, we have been working on making Iceberg metadata tables available in Apache Impala, a high-performance, distributed, massively parallel query engine. Query execution in Impala is implemented in C++, which presents some challenges as the Iceberg API is Java-based.

In addition, the format in which Iceberg returns data (as Java objects, accessed through JNI) also differs from the format in which Impala normally receives its input data from on-disk files, even for the same SQL data types. The difference is especially pronounced in the case of complex types (structs, arrays and maps), and this necessitated extra steps to incorporate them.

This talk will guide you through all the new Impala features related to Iceberg metadata tables and describe how we overcame the obstacles that arose during their implementation.

Row-level modifications at petabyte-scale via Impala on Iceberg

Péter Rózsa, Zoltán Borók-Nagy

ROOM 4: Fri, 4:15 pm–4:45 pm

Apache Impala is a distributed, massively parallel query engine for big data. Initially, it focused on fast query execution on top of large datasets that were ingested via long-running batch jobs. The table schema and the ingested data typically remained unchanged, and row-level modifications were impractical. Today's expectations for modern data warehouse engines have risen significantly; users now want to have RDBMS-like capabilities in their data warehouses, e.g.: schema and partition evolution, time-travel, and the focus of this talk: row-level modification. Apache Iceberg is a cutting-edge table format that delivers these advanced write capabilities for large-scale data.

In this talk, you will learn how Apache Impala leverages Iceberg’s writing capabilities for large-scale data and how Impala implements data manipulation operations like DELETE, UPDATE, and MERGE. Join us for this session to discover how Impala has evolved to meet these emerging requirements.

Impala 4.4: A More Intelligent Query Engine

Quanlong Huang, Manish Maheshwari

ROOM 3 Fri, 5:15 pm–5:45 pm

Apache Impala is a native query engine implemented in MPP architecture for open data and table formats. In this session, we will share the updates in the Impala community during the past year, including the core features of the upcoming 4.4 release, e.g. Workload Aware Auto-Scaling, Calcite integration, JDBC Federation, Codegen caching, Intermediate Results Caching, Query History Table, etc.

Simplifying Iceberg Table Lifecycle Management: A Comprehensive Approach

Yan Liu, Bill Zhang

ROOM 4 Fri ,5:15 pm–5:45 pm

Apache Iceberg tables have emerged as a powerful solution for managing large-scale data lakes, offering features like ACID-compliant transactions, efficient data organization, and schema evolution. However, as data lakes grow in complexity, so do the challenges in managing the lifecycle of Iceberg tables effectively.

This session will delve into strategies for simplifying the management of Iceberg tables throughout their lifecycle. We will explore techniques for optimizing table creation, time travel, compaction, schema evolution, data retention, and archival processes. Additionally, we will discuss best practices for monitoring and maintaining the health and performance of Iceberg tables.

Let’s see how fast Impala runs on Iceberg

Zoltán Borók-Nagy

ROOM 8, Fri 5:15 pm–5:45 pm

Apache Impala is a distributed, massively parallel query engine designed for high-performance querying of large-scale data. There has been a long list of new features recently around supporting Apache Iceberg tables, such as reading, writing, time travel, and so on. However, in a big data environment, it is also a must to be performant. Since Impala has been designed to be fast, it has its own way of reading Iceberg tables. Other engines might simply use the Iceberg library to perform reads, while Impala has a C++ implementation optimized for speed.

Nowadays, even big data storage techniques have to offer the possibility to store data but also to alter and delete data on a row level. Apache Iceberg solves this by using delete files that live alongside the data files. It is the responsibility of the query engines to then apply the delete files on the data files when querying the data. To efficiently read the data of such tables, we implemented new Iceberg-specific operators in Impala.

In this talk, we will go into the implementation details and reveal what is the secret behind Impala’s great performance in general and also when reading Iceberg tables with position delete files. We will also show some measurements where we compare Impala’s performance with other open-source query engines.

By the end of this talk, you should have a high-level understanding of Impala and Iceberg’s architecture, the performance tricks we implemented in Impala specifically for Iceberg, and how Impala competes with other engines.

Overview of tools, techniques and tips: Scaling Ozone performance to max out CPU, Network and Disk

Ritesh Shukla, Tanvi Penumudy

ROOM 8 Fri, 4:15 pm–4:45 pm

Over the past year, significant advancements have been made in enhancing the performance of Ozone, a distributed object store designed to scale on commodity hardware and capable of handling billions of files. These enhancements not only ensure that Ozone can saturate the network, provided the drives are sufficiently fast, but also improve the performance from a single-thread perspective during both read and write operations. This talk delves into the various tools and techniques employed to achieve these performance gains, highlighting key optimizations at the system level.

We will cover (but not limited to):

Serialization and Deserialization impact and optimizations .
Impact of Zero Copy Buffers: This is particularly beneficial for large data transfers, reducing the impact on system performance and enhancing overall efficiency.
Concurrency Improvements: Enhancements in the concurrency model over Ratis, which implements the Raft consensus algorithm, have led to better utilization of system resources, reducing bottlenecks and improving response times.
Core System Level Optimizations: Numerous other system-level optimizations have been implemented, including more efficient memory management, optimized I/O paths, and enhanced error handling mechanisms.

Using Metrics and Grafana Dashboards for Performance Monitoring

One of the key aspects of our performance enhancement strategy involves the use of metrics and Grafana dashboards. These tools allow us to monitor performance across the cluster effectively and gain insights that help in further optimization.

Stitching Performance Across the Cluster: By integrating metrics from various nodes into a centralized Grafana dashboard, we can view the entire cluster's performance at a glance. This holistic view is instrumental in identifying bottlenecks and uneven load distribution across the cluster.

Designing Effective Dashboards: Effective dashboard design is crucial for efficient monitoring. Our focus has been on identifying and displaying metrics that provide real-time insights into the system’s performance, helping us make informed decisions quickly.

Choosing the Right Metrics: Selecting the appropriate metrics to monitor is critical. We prioritize metrics that reflect the system’s health and performance, such as latency, throughput, error rates, and resource utilization.

Advanced Diagnostic Techniques

Further, we employ advanced diagnostic techniques to pinpoint performance issues and optimize system behavior.

Using Flamegraphs: Flamegraphs are invaluable for identifying hotspots and understanding the behavior of our system under different load conditions. They provide a visual stack trace of process execution, which helps in isolating performance issues and optimizing code paths.

Collaboration and Innovation within the Hadoop Ecosystem

A unique advantage of our approach lies in our collaboration with the Hadoop ecosystem, particularly with teams like HBase. This collaboration has fostered innovation that benefits both:

Ozone and HBase:

Integration with HBase: Our work with the HBase team has led to better integration between Ozone and HBase, allowing HBase to leverage the scalability and robustness of Ozone for storage management. This synergy has not only improved performance but also enhanced the capabilities of both systems.
Innovating Up the Stack: The close collaboration within the Hadoop ecosystem enables us to innovate "up the stack." By understanding and optimizing the interaction between different layers of the stack, we can introduce improvements that significantly enhance overall system performance and efficiency.

Navigating the Lakehouse with Confidence: Best Practices for Implementation with Apache Iceberg

Bill Zhang

ROOM 5, Sat 3:00 pm - 3:30 pm

The convergence of data lakes and data warehouses into a unified architecture, known as the Lakehouse paradigm, has gained significant traction in the data engineering community. Apache Iceberg has emerged as a cornerstone technology for implementing Lakehouse architectures, providing robust features for managing large-scale, transactional data lakes efficiently.

In this session, we will explore best practices for implementing a Lakehouse architecture using Apache Iceberg. Through real-world examples and practical insights, attendees will learn how to design, deploy, and optimize a Lakehouse solution that leverages the strengths of Iceberg for data management, reliability, and performance.

Recognize, Reconcile, and Repeat: The Path to Uniform Replicas in Apache Ozone

Ethan Rose, Rishe Shukla

ROOM 4 Sat 4:45 pm - 5:15 pm

Faults in distributed systems often are unbounded. Apache Ozone needs to be resistant to faults and preserve the durability and consistency of data in the face of these faults. Reconciliation is a new feature being developed that allows Ozone to recover from faults independent of how they occurred as well as the duration between faults. Reconciliation allows peers to identify, report, and resolve discrepancies of any type and works across data durability models such as replication as well as erasure coding. The system can initiate and complete this process without administrative intervention. This talk will cover the motivation behind replica reconciliation, the general strategy that can be applied to any storage system, and the implementation details that make it possible.

Apache Nifi and Minifi Use cases for IoT

Yan Liu

ROOM 6: Sun, 5:15 pm–5:45 pm

Introduce and Reintroduce Apache Nifi and Minifi for various IoT data collection and data routing use cases, with technical deep dives

Please feel free to find out more information from: https://asia.communityovercode.org/