Created on 03-24-202003:07 AM - edited on 03-31-202012:40 AM by VidyaSargur
Apache Airflow is a popular environment for scheduling and monitoring workflows. Running workflows is useful in machine learning for automation of data acquisition or the building and monitoring of machine learning models.
This tutorial will take you through the process of setting up Airflow as an Application in CML. This can be very useful in building out testing and prototype pipelines. It also provides a mechanism for calling chains of Models or Jobs.
Create a new Python project in CML using a blank template:New Project
Open up a Workspace and install Airflow, this can be scripted using the install instructions here.
The following shell script and python scripts can be used to automate this process. The installation needs to be performed only once: install.py:
# install from pypi using pip
pip3 install apache-airflow
# initialize the database
This will install the Airflow components into your project and will persist with the lifecycle of the project.
Now the software is installed the next step is starting the process as a long-lived application. This requires a python file and shell script to start up the services. install.py:
# start the web server, default port is 8090
airflow webserver -p $PORT -hn 127.0.0.1
# start the scheduler
# visit localhost:8080 in the browser and enable the example dag in the home page
The PORT variable is set to the environment Application Port and started on localhost. This script can be tested in a workbench, or they can be added to an Application with the following settings.Create Application
Validate that the application starts up cleanly by accessing the Applications logs:Monitor Logs
Opening the application will load directly into the Airflow management application.Access Airflow UX
Using the Airflow HTTP Operator it is possible to call CML Jobs directly via the CML Job API.
This is hosting Airflow within a CML container. This deployment will be isolated to the Project and will continue to run and use resources associated to the project. This provides a good prototyping environment for building out a more complete end-to-end data engineering pipelines that combine scheduling and flow with python, shell, Scala, and Spark processes.
We would be interested to hear from you, if you apply this to a project.