- Subscribe to RSS Feed
- Mark as New
- Mark as Read
- Bookmark
- Subscribe
- Printer Friendly Page
- Report Inappropriate Content
Created on 07-29-2024 01:45 AM - edited 07-31-2024 11:00 PM
Problem Overview
As organizations mature in implementing AI systems, they are looking to standardize workflows across different machine learning use cases. Here is one key challenge to address in the ML lifecycle. ML teams spend considerable time in the development lifecycle on data exploration and data analysis activities. These tasks are crucial because the type and quality of the datasets used as inputs for model training directly affect how much time is spent training the model and how well it performs afterward. Before being used for model training, attributes in a dataset known as features go through a lot of changes. These transformations can be simple, e.g., changing a categorical variable such as “opt for email subscriptions” from yes and no to boolean values 1 and 0. Or, it can be complex, including merging multiple input fields into a new field, e.g., labeling a song with a combination of attributes such as music genre, decade of origin, country, and band size to come up with a unique “musicality” feature.
Since the feature development process is demanding, ML teams want to share common features and also establish common internal standards in terms of how new features in a dataset are created or updated. Discovering available features and reusing them if available would save significant time for an ML Engineer before she decides to build one from scratch. Feature stores address this need by
- Making features consistently available for training and serving
- Decoupling the feature access from the underlying data infrastructure.
The “Decoupling” aspect mentioned earlier is important. Datasets used for developing features in an ML lifecycle could be stored on multiple data stores in different formats, requiring different access mechanisms. ML Engineers need to have a way to consistently discover and access the features they need to develop solutions for their business use cases without really worrying about the infrastructure that stores the data or the data access mechanisms.
In this article, we discuss how we can set up a Feature store such as FEAST for developers to take advantage of Feature consistency and low-latency feature availability. We will integrate Feast on CDP on Cloudera Machine Learning.
Architecture
Since Cloudera Machine Learning is a highly extensible ML platform, integrating feature stores is quite easy. The below picture shows an example of integrating a feature store such as FEAST on the Cloudera Data Platform.
Some important aspects of this implementation are provided below:
Component | Name | Description |
Hosting Platform | Cloudera Machine Learning | The Feature store is set up in a containerized instance of Cloudera machine learning. This allows the store to scale up and down as needed. |
Feast Offline Store | Spark on Hive | We use Spark on Hive as the offline store |
Online Store | SQLite | This is a low-latency database that is also hosted on the Cloudera Machine Learning platform for demo purposes. In a real-life use case, this can be a separate database ( e.g. Postgres). See Feast documentation for support online databases |
Catalog | Feast Registry / Catalog | Contains the feature services and views that are then used to fetch features |
Feast Application | Application on Cloudera Machine Learning | A web-based, read-only front-end application using Streamlit that provides an easy-to-use UI for feature discovery. |
Implementation:
Prerequisite Steps :
The following prerequisites are needed for setting up Feast on CML:
- A Cloudera Machine Learning service instance access
- Access to the S3 bucket in the Datalake environment from Cloudera Machine Learning
- Launch a new Cloudera Machine Learning project and clone the Github repo in resources directly at project creation or by starting a new terminal session.
- Follow the instructions step by step to setup feast on Cloudera Machine learning as a part of the README.md in the GitHub repo
- Run the DataLoad.ipynb jupyter notebook to load the dataset.
Setting up Elements of the FEAST using Feast CLI:
- Start a new Session in Cloudera Machine Learning with the “Enable Spark” option and an editor like Jupyter Notebook.
- Once the session is launched, start a new Terminal window and run the following commands in sequence. You may run into some warnings from Feast; we will ignore them for now.
$ cd feast_project/feature_repo
$ feast apply
- If the setup is complete, you should be able to see the setup of the feature service as follows:
Let us take a moment to understand what all this means. This demo Feast setup uses a driver dataset for a ride aggregator. The instructions in the GitHub repo help you set up this dataset for consumption by a Feature store such as Feast.
Feast messages tell us that it has created an entity called a driver with different feature views and services for driver activity. We can also see that it has set up our online store in a SQLite table.
Understanding Feature Store Configurations
|- feast_project
|- data : Online feature database
|- feature_repo
|- feature_store.yaml : Feature store configuration file for offline and online stores
|- example_repo.py : Feature definition file
|- DataLoad.ipynb : Validate Spark's ability to fetch the offline store data
|- feast-ux.py : used to load the Feast UI
|- feature-store-dev.ipynb : Interactive notebook for Feast historical and online store access
The above folder structure gets installed when you clone the GitHub repository in the resources.
A Feature store has to be configured in 2 places :
- The feature_store.yaml is used to define the location of the offline and online stores. This needs to include the location of the S3 bucket, which will be used by Spark to load the offline features. Similarly, the online feature store configurations are also added here. Specifically, you will need to change the lines here to your specific spark configurations:
spark.sql.warehouse.dir: "s3a://your-path-to-hive-folder"
spark.hadoop.fs.s2a.s3guard.ddb.region: "us-east-1"
spark.kerberos.access.hadoopFileSystems: "s3a://your-bucket/"
- The example_repo.py shows the Feast configurations for setting up the feature store, including source, views, and services. While no changes are required to be made here, it is recommended that you review Feast documentation to understand the configuration details.
Use Case Architecture Diagram
Feature Use case Architecture
As mentioned earlier, the above architecture shows the different components of a driver statistics dataset for a ride aggregator that can be used to serve both historical and on-demand features.
Feature Discovery
Feast provides a very intuitive interface besides good CLI capabilities to help with feature discovery. Here is an example of using the feastcli to get the list of features for a specific feature view driver-hourly-stats
cdsw@tpwutuac3d78s486:~/feast_project/feature_repo$ feast feature-views describe driver_hourly_stats
spec:
name: driver_hourly_stats
entities:
- driver
features:
- name: conv_rate
valueType: FLOAT
- name: acc_rate
valueType: FLOAT
- name: avg_daily_trips
valueType: INT64\
description: Average daily trips
tags:
team: driver_performance
ttl: 259200000s
batchSource:
type: BATCH_SPARK
timestampField: event_timestamp
createdTimestampColumn: created
dataSourceClassType: feast.infra.offline_stores.contrib.spark_offline_store.spark_source.SparkSource
name: driver_hourly_stats_source
sparkOptions:
table: driver_stats
online: true
entityColumns:
- name: driver_id
valueType: INT64
meta:
createdTimestamp: '2024-07-19T05:35:52.471076Z'
lastUpdatedTimestamp: '2024-07-19T05:35:52.471076Z'
Similarly, FEAST UI could also be used for feature discovery. Perform the following steps to launch Feast UI as an application in Cloudera Machine Learning
- Create a New Application in Cloudera Machine Learning using the following configuration parameters:
- Name : Feast UI
- File : feast-ux.py
- Resources : 1 vCPU , 2 GB Memory
- You should now be able to launch the FEAST UI application as shown below:
- Use the menu items on the left for feature service discovery, including feature views and feature services available.
Feature Consumption
Now that we have set up the feature store, we will need to use Feast libraries to access these features. Launch a new session with feature-store-dev.ipynb to understand how to consume offline and online features from the feature store. Given below is an example code for using the online store to get some pre-computed features with low latency
# Fetching Feature vectors for inference
from pprint import pprint
from feast import FeatureStore
store = FeatureStore(repo_path="./feast_project/feature_repo")
feature_vector = store.get_online_features(
features=[
"driver_hourly_stats:conv_rate",
"driver_hourly_stats:acc_rate",
"driver_hourly_stats:avg_daily_trips",
],
entity_rows=[
# {join_key: entity_value}
{"driver_id": 1001},
{"driver_id": 1002},
],
).to_dict()
pprint(feature_vector)
##OUTPUT
{'acc_rate': [0.011025987565517426, 0.3090273141860962],
'avg_daily_trips': [711, 44],
'conv_rate': [0.8127095699310303, 0.13138850033283234],
'driver_id': [1001, 1002]}
Resources and References
- Github Repo: Feast on CML: Repository to use to set up the above feature store use case
- https://feast.dev/: API calls and build an understanding how Feast works.