Community Articles

VidyaSargur · ‎03-24-2020

Apache Airflow is a popular environment for scheduling and monitoring workflows. Running workflows is useful in machine learning for automation of data acquisition or the building and monitoring of machine learning models.

This tutorial will take you through the process of setting up Airflow as an Application in CML. This can be very useful in building out testing and prototype pipelines. It also provides a mechanism for calling chains of Models or Jobs.

Create a new Python project in CML using a blank template:New Project
Open up a Workspace and install Airflow, this can be scripted using the install instructions here.

The following shell script and python scripts can be used to automate this process. The installation needs to be performed only once:
install.py:

import os
os.system("./install.sh")

install.sh:

#!/bin/bash -x

export AIRFLOW_HOME=~/airflow

# install from pypi using pip
pip3 install apache-airflow

# initialize the database
airflow initdb

This will install the Airflow components into your project and will persist with the lifecycle of the project.

Now the software is installed the next step is starting the process as a long-lived application. This requires a python file and shell script to start up the services.
install.py:

import os
os.system("./start.sh")

start.sh:

#!/bin/bash -x

PORT=${CDSW_APP_PORT:-8090}

export AIRFLOW_HOME=~/airflow

# start the web server, default port is 8090
airflow webserver -p $PORT -hn 127.0.0.1

# start the scheduler
airflow scheduler

# visit localhost:8080 in the browser and enable the example dag in the home page

The PORT variable is set to the environment Application Port and started on localhost. This script can be tested in a workbench, or they can be added to an Application with the following settings.Create Application
Validate that the application starts up cleanly by accessing the Applications logs:Monitor Logs
Opening the application will load directly into the Airflow management application.Access Airflow UX
Using the Airflow HTTP Operator it is possible to call CML Jobs directly via the CML Job API.

This is hosting Airflow within a CML container. This deployment will be isolated to the Project and will continue to run and use resources associated to the project. This provides a good prototyping environment for building out a more complete end-to-end data engineering pipelines that combine scheduling and flow with python, shell, Scala, and Spark processes.

We would be interested to hear from you, if you apply this to a project.

Cloudera Community

Community Articles

How to host Apache Airflow within CDP Cloudera Machine Learning (CML)

Cloudera Data Platform (CDP)

PandasOnSpark in Cloudera Machine Learning (CML)

Price Optimization with PyGurobi in Cloudera Machi...

Writing files to Cloudera Machine Learning using A...

Using Custom Data Connections in Cloudera Machine ...

Spark Auto-Scaling with Kubernetes in Cloudera Dat...

CDE Airflow for CML Pipeline Orchestration

Accelerating ML models with distributed Xgboost in...

Machine Learning with SQL using Apache Hive and Hi...

How to provision Workspaces in Cloudera Machine Le...

Cloudera Machine Learning adds ability for admins ...