Community Articles

Find and share helpful community-sourced technical articles.
Labels (1)
avatar
Contributor

Apache Airflow is a popular environment for scheduling and monitoring workflows. Running workflows is useful in machine learning for automation of data acquisition or the building and monitoring of machine learning models.

 

This tutorial will take you through the process of setting up Airflow as an Application in CML. This can be very useful in building out testing and prototype pipelines. It also provides a mechanism for calling chains of Models or Jobs.

 

  1. Create a new Python project in CML using a blank template:New ProjectNew Project
  2. Open up a Workspace and install Airflow, this can be scripted using the install instructions here.
  3. The following shell script and python scripts can be used to automate this process. The installation needs to be performed only once:
    install.py:
    import os
    os.system("./install.sh")​

    install.sh:
    #!/bin/bash -x
    
    export AIRFLOW_HOME=~/airflow
    
    # install from pypi using pip
    pip3 install apache-airflow
    
    # initialize the database
    airflow initdb​
  4. This will install the Airflow components into your project and will persist with the lifecycle of the project.
  5. Now the software is installed the next step is starting the process as a long-lived application. This requires a python file and shell script to start up the services.
    install.py:
    import os
    os.system("./start.sh")​

    start.sh:
    #!/bin/bash -x
    
    PORT=${CDSW_APP_PORT:-8090}
    
    export AIRFLOW_HOME=~/airflow
    
    # start the web server, default port is 8090
    airflow webserver -p $PORT -hn 127.0.0.1
    
    # start the scheduler
    airflow scheduler
    
    # visit localhost:8080 in the browser and enable the example dag in the home page​
  6. The PORT variable is set to the environment Application Port and started on localhost. This script can be tested in a workbench, or they can be added to an Application with the following settings.Create ApplicationCreate Application
  7. Validate that the application starts up cleanly by accessing the Applications logs:Monitor LogsMonitor Logs
  8. Opening the application will load directly into the Airflow management application.Access Airflow UXAccess Airflow UX
  9. Using the Airflow HTTP Operator it is possible to call CML Jobs directly via the CML Job API.  

This is hosting Airflow within a CML container.  This deployment will be isolated to the Project and will continue to run and use resources associated to the project. This provides a good prototyping environment for building out a more complete end-to-end data engineering pipelines that combine scheduling and flow with python, shell, Scala, and Spark processes.

 

We would be interested to hear from you, if you apply this to a project.

4,510 Views