What's New @ Cloudera

Introducing Apache Airflow 2 in Cloudera Data Engineering for end-to-end data pipeline orchestration

Cloudera Employee

As a key component of Cloudera Data Engineering (CDE), Apache Airflow has served as a flexible orchestration service for data engineers and practitioners to develop and operationalize end-to-end data pipelines. Today, many customers use the managed Airflow service to avoid the administrative overhead of maintaining and tuning their own full-fledged scheduler. Instead they choose to rely on the out-of-the-box security and autoscaling compute enabled in CDE to deploy tens to hundreds of DAGs (Directed Acyclic Graphs) using CDE's job management APIs. And with integration with CDP data services, pipelines can flexibly tap into the efficient containerized compute of Spark in CDE and Hive in Cloudera Data Warehouse (CDW).

With Airflow 2.1 as the new default managed scheduler, customers can continue to rely on the low administrative overhead they have come to expect while users can reap the benefits of the latest developments in the upstream community. As any major release, many aspects of Airflow have been enhanced including: scheduler speedup of up to 17x, a more optimized method for organizing tasks through task groups, a full UI refresh, and a new way of writing DAGs using the TaskFlow API. Airflow 2.1 as part of CDE comes with governance, security and compute autoscaling enabled out-of-the-box, along with integration with CDE's job management APIs giving users the flexibility to deploy custom DAGs that tap into Cloudera Data Platform (CDP) data services like Spark in CDE and Hive in CDW.