Community Articles

Find and share helpful community-sourced technical articles.
avatar
Contributor

Problem Overview 

As organizations mature in implementing AI systems, they are looking to standardize workflows across different machine learning use cases. Here is one key challenge to address in the ML lifecycle. ML teams spend considerable time in the development lifecycle on data exploration and data analysis activities. These tasks are crucial because the type and quality of the datasets used as inputs for model training directly affect how much time is spent training the model and how well it performs afterward. Before being used for model training, attributes in a dataset known as features go through a lot of changes. These transformations can be simple, e.g., changing a categorical variable such as “opt for email subscriptions” from yes and no to boolean values 1 and 0. Or, it can be complex, including merging multiple input fields into a new field, e.g., labeling a song with a combination of attributes such as music genre, decade of origin, country, and band size to come up with a unique “musicality” feature. 

Since the feature development process is demanding, ML teams want to share common features and also establish common internal standards in terms of how new features in a dataset are created or updated. Discovering available features and reusing them if available would save significant time for an ML Engineer before she decides to build one from scratch. Feature stores address this need by 

  • Making features consistently available for training and serving
  • Decoupling the feature access from the underlying data infrastructure. 

The “Decoupling” aspect mentioned earlier is important. Datasets used for developing features in an ML lifecycle could be stored on multiple data stores in different formats, requiring different access mechanisms. ML Engineers need to have a way to consistently discover and access the features they need to develop solutions for their business use cases without really worrying about the infrastructure that stores the data or the data access mechanisms.

In this article, we discuss how we can set up a Feature store such as FEAST for developers to take advantage of Feature consistency and low-latency feature availability. We will integrate Feast on CDP on Cloudera Machine Learning.

Architecture

Since  Cloudera Machine Learning is a highly extensible ML platform,  integrating feature stores is quite easy. The below picture shows an example of integrating a feature store such as FEAST on the Cloudera Data Platform. 

VishRajagopalan_3-1721802165116.png

Some important aspects of this implementation are provided below:

Component

Name

Description

Hosting Platform

Cloudera Machine Learning

The Feature store is set up in a containerized instance of Cloudera machine learning. This allows the store to scale up and down as needed.

Feast Offline Store

Spark on Hive

We use Spark on Hive as the offline store 

Online Store

SQLite

This is a low-latency database that is also hosted on the Cloudera Machine Learning platform for demo purposes. In a real-life use case, this can be a separate database ( e.g. Postgres). See Feast documentation for support online databases

Catalog

Feast Registry / Catalog

Contains the feature services and views that are then used to fetch features 

Feast Application 

Application on Cloudera Machine Learning

A web-based, read-only front-end application using Streamlit that provides an easy-to-use UI for feature discovery. 

Implementation:

Prerequisite Steps : 

The following prerequisites are needed for setting up Feast on CML:

  •  A Cloudera Machine Learning service instance access
  • Access to the S3 bucket in the Datalake environment from Cloudera Machine Learning
  • Launch a new Cloudera Machine Learning project and clone the Github repo in resources directly at project creation or by starting a new terminal session. 
  • Follow the instructions step by step to setup feast on Cloudera Machine learning as a part of the README.md in the GitHub repo
  • Run the DataLoad.ipynb jupyter notebook to load the dataset. 

Setting up Elements of the FEAST using Feast CLI:

  • Start a new Session in Cloudera Machine Learning with the “Enable Spark” option and an editor like Jupyter Notebook.
    VishRajagopalan_5-1721802349126.png
  • Once the session is launched, start a new Terminal window and run the following commands in sequence. You may run into some warnings from Feast; we will ignore them for now. 

 

 

$ cd feast_project/feature_repo
$ feast apply

 

 

 

 

  • If the setup is complete, you should be able to see the setup of the feature service as follows:
    VishRajagopalan_7-1721802505510.png

Let us take a moment to understand what all this means. This demo Feast setup uses a driver dataset for a ride aggregator. The instructions in the GitHub repo help you set up this dataset for consumption by a Feature store such as Feast.

Feast messages tell us that it has created an entity called a driver with different feature views and services for driver activity. We can also see that it has set up our online store in a SQLite table.

 Understanding Feature Store Configurations

 

 

 

 

|- feast_project
    |- data : Online feature database 
    |- feature_repo 
        |- feature_store.yaml : Feature store configuration file for offline and online stores
        |- example_repo.py : Feature definition file 
|- DataLoad.ipynb : Validate Spark's ability to fetch the offline store data
|- feast-ux.py : used to load the Feast UI
|- feature-store-dev.ipynb : Interactive notebook for Feast historical and online store access

 

 

 

 

The above folder structure gets installed when you clone the GitHub repository in the resources. 

A Feature store has to be configured in 2 places : 

  • The feature_store.yaml is used to define the location of the offline and online stores. This needs to include the location of the S3 bucket, which will be used by Spark to load the offline features. Similarly, the online feature store configurations are also added here. Specifically, you will need to change the lines here to your specific spark configurations:

 

 

spark.sql.warehouse.dir: "s3a://your-path-to-hive-folder"
spark.hadoop.fs.s2a.s3guard.ddb.region: "us-east-1"
spark.kerberos.access.hadoopFileSystems: "s3a://your-bucket/"

 

 

  • The example_repo.py shows the Feast configurations for setting up the feature store, including source, views, and services. While no changes are required to be made here, it is recommended that you review Feast documentation to understand the configuration details. 

Use Case Architecture Diagram

Feature Use case ArchitectureFeature Use case Architecture

As mentioned earlier, the above architecture shows the different components of a driver statistics dataset for a ride aggregator that can be used to serve both historical and on-demand features. 

Feature Discovery

Feast provides a very intuitive interface besides good CLI capabilities to help with feature discovery. Here is an example of using the feastcli to get the list of features for a specific feature view driver-hourly-stats

 

 

 

 

cdsw@tpwutuac3d78s486:~/feast_project/feature_repo$ feast feature-views describe driver_hourly_stats 
spec:
  name: driver_hourly_stats
  entities:
  - driver
  features:
  - name: conv_rate
    valueType: FLOAT
  - name: acc_rate
    valueType: FLOAT
  - name: avg_daily_trips
    valueType: INT64\
    description: Average daily trips
  tags:
    team: driver_performance
  ttl: 259200000s
  batchSource:
    type: BATCH_SPARK
    timestampField: event_timestamp
    createdTimestampColumn: created
    dataSourceClassType: feast.infra.offline_stores.contrib.spark_offline_store.spark_source.SparkSource
    name: driver_hourly_stats_source
    sparkOptions:
      table: driver_stats
  online: true
  entityColumns:
  - name: driver_id
    valueType: INT64
meta:
  createdTimestamp: '2024-07-19T05:35:52.471076Z'
  lastUpdatedTimestamp: '2024-07-19T05:35:52.471076Z'

 

 

 

 

Similarly, FEAST UI could also be used for feature discovery. Perform the following steps to launch Feast UI as an application in Cloudera Machine Learning

  • Create a New Application in Cloudera Machine Learning using the following configuration parameters:
    • Name : Feast UI
    • File : feast-ux.py
    • Resources : 1 vCPU , 2 GB Memory
  • You should now be able to launch the FEAST UI application as shown below:unnamed (2).png
  • Use the menu items on the left for feature service discovery, including feature views and feature services available.

Feature Consumption

Now that we have set up the feature store, we will need to use Feast libraries to access these features. Launch a new session with feature-store-dev.ipynb to understand how to consume offline and online features from the feature store. Given below is an example code for using the online store to get some pre-computed features with low latency

 

 

 

 

# Fetching Feature vectors for inference
from pprint import pprint
from feast import FeatureStore
store = FeatureStore(repo_path="./feast_project/feature_repo")
feature_vector = store.get_online_features(
    features=[
        "driver_hourly_stats:conv_rate",
        "driver_hourly_stats:acc_rate",
        "driver_hourly_stats:avg_daily_trips",
    ],
    entity_rows=[
        # {join_key: entity_value}
        {"driver_id": 1001},
        {"driver_id": 1002},
    ],
).to_dict()
pprint(feature_vector)
##OUTPUT
{'acc_rate': [0.011025987565517426, 0.3090273141860962],
 'avg_daily_trips': [711, 44],
 'conv_rate': [0.8127095699310303, 0.13138850033283234],
 'driver_id': [1001, 1002]}

 

 

 

 

Resources and References

1,672 Views
0 Kudos