In this module, we focus on building features for online serving, and keeping them fresh with a combination of batch feature materialization and stream feature ingestion. We'll be roughly working towards the following:
- Data sources: Kafka + File source
- Online store: Redis
- Use case: Predicting churn for drivers in real time.
- Workshop
- Conclusion
- FAQ
First, we install Feast with Spark and Redis support:
pip install "feast[spark,redis]"We've changed the original driver_stats.parquet to include some new fields and aggregations. You can follow along in explore_data.ipynb:
import pandas as pd
pd.read_parquet("feature_repo/data/driver_stats.parquet")The key thing to note is that there are now a miles_driven field and a daily_miles_driven (which is a pre-computed aggregation).
project: feast_demo_local
provider: local
registry:
path: data/local_registry.db
cache_ttl_seconds: 5
online_store:
type: redis
connection_string: localhost:6379
offline_store:
type: fileThe key thing to note for now is the online store has been configured to be Redis. This is specifically for a single Redis node. If you want to use a Redis cluster, then you'd change this to something like:
project: feast_demo_local
provider: local
registry:
path: data/local_registry.db
cache_ttl_seconds: 5
online_store:
type: redis
redis_type: redis_cluster
connection_string: "redis1:6379,redis2:6379,ssl=true,password=my_password"
offline_store:
type: fileBecause we use redis-py under the hood, this means Feast also works well with hosted Redis instances like AWS Elasticache (docs).
We then use Docker Compose to spin up the services we need.
- This leverages a script (in
kafka_demo/) that creates a topic, reads fromfeature_repo/data/driver_stats.parquet, generates newer timestamps, and emits them to the topic. - This also deploys an instance of Redis.
- This also deploys a Feast push server (on port 6567) + a Feast feature server (on port 6566).
- These servers embed a
feature_store.yamlfile that enables them to connect to a remote registry. The Dockerfile mostly delegates to calling thefeast serveCLI command, which instantiates a Feast python server (docs):FROM python:3.7 RUN pip install "feast[redis]" COPY feature_repo/feature_store.yaml feature_store.yaml # Needed to reach online store within Docker network. RUN sed -i 's/localhost:6379/redis:6379/g' feature_store.yaml ENV FEAST_USAGE=False CMD ["feast", "serve", "-h", "0.0.0.0"]
- These servers embed a
Start up the Docker daemon and then use Docker Compose to spin up the services as described above:
- You may need to run
sudo docker-compose upif you run into a Docker permission denied error
$ docker-compose up
Creating network "module_1_default" with the default driver
Creating zookeeper ... done
Creating redis ... done
Creating broker ... done
Creating feast_feature_server ... done
Creating feast_push_server ... done
Creating kafka_events ... done
Attaching to zookeeper, redis, broker, feast_push_server, feast_feature_server, kafka_events
...Relying on streaming features in Feast enables data scientists to increase freshness of the features they rely on, decreasing training / serving skew.
A data scientist may start out their feature engineering in their notebook by directly reading from the batch source (e.g. a table they join in a data warehouse).
But then they hand this over to an engineer to productionize and realize that the model performance is different because pipeline delays lead to stale data being served.
With Feast, at training data generation time, the data scientist can directly depend on a FeatureView with a PushSource, which ensures consistent access to fresh data at serving time, thus resulting in less training / serving skew.
Let's take a look at an example FeatureView in this repo that uses a PushSource:
from feast import (
FileSource,
PushSource,
)
driver_stats = FileSource(
name="driver_stats_source",
path="data/driver_stats.parquet",
timestamp_field="event_timestamp",
created_timestamp_column="created",
description="A table describing the stats of a driver based on hourly logs",
owner="test2@gmail.com",
)
# A push source is useful if you have upstream systems that transform features (e.g. stream processing jobs)
driver_stats_push_source = PushSource(
name="driver_stats_push_source", batch_source=driver_stats,
)
driver_daily_features_view = FeatureView(
name="driver_daily_features",
entities=["driver"],
ttl=timedelta(seconds=8640000000),
schema=[Field(name="daily_miles_driven", dtype=Float32),],
online=True,
source=driver_stats_push_source,
tags={"production": "True"},
owner="test2@gmail.com",
)Using a PushSource alleviates this. The data scientist by using the driver_daily_features feature view that at serving time, the model will have access to as fresh of a value as possible. Engineers now just need to make sure that any registered PushSources have feature values being regularly pushed to make the feature consistently available.
In the future, Feast will support the concept of a StreamFeatureView as well, which simplifies the life for the engineer further. This will directly ingest from a streaming source (e.g. Kafka) and apply transformations so an engineer doesn't need to scan for PushSources and push data into Feast.
We'll switch gears into a Jupyter notebook. This will guide you through:
- Registering a
FeatureViewthat has a single schema across both a batch source (FileSource) with aggregate features and a stream source (PushSource).- Note: Feast will, in the future, also support directly authoring a
StreamFeatureViewthat contains stream transformations / aggregations (e.g. via Spark, Flink, or Bytewax)
- Note: Feast will, in the future, also support directly authoring a
- Materializing feature view values from batch sources to the online store (e.g. Redis).
- Ingesting feature view values from streaming sources (e.g. window aggregate features from Spark + Kafka)
- Retrieve features at low latency from Redis through Feast.
- Working with a Feast push server + feature server to ingest and retrieve features through HTTP endpoints (instead of needing
feature_store.yamlandFeatureStoreinstances)
Run the Jupyter notebook (feature_repo/workshop.ipynb).
To ensure fresh features, you'll want to schedule materialization jobs regularly. This can be as simple as having a cron job that calls feast materialize-incremental.
Users may also be interested in integrating with Airflow, in which case you can build a custom Airflow image with the Feast SDK installed, and then use a BashOperator (with feast materialize-incremental) or PythonOperator (with store.materialize_incremental(datetime.datetime.now())):
# Define Python callable
def materialize():
repo_config = RepoConfig(
registry=RegistryConfig(path="s3://[YOUR BUCKET]/registry.pb"),
project="feast_demo_aws",
provider="aws",
offline_store="file",
online_store=DynamoDBOnlineStoreConfig(region="us-west-2")
)
store = FeatureStore(config=repo_config)
store.materialize_incremental(datetime.datetime.now())
# Use PythonOperator
materialize_python = PythonOperator(
task_id='materialize_python',
python_callable=materialize,
)# Use BashOperator
materialize_bash = BashOperator(
task_id='materialize',
bash_command=f'feast materialize-incremental {datetime.datetime.now().replace(microsecond=0).isoformat()}',
)See also FAQ: How do I speed up or scale up materialization?
The above notebook introduces a way to curl an HTTP endpoint to push or retrieve features from Redis.
The servers by default cache the registry (expiring and reloading every 10 minutes). If you want to customize that time period, you can do so in feature_store.yaml.
Let's look at the feature_store.yaml used in this module (which configures the registry differently than in the previous module):
project: feast_demo_local
provider: local
registry:
path: data/local_registry.db
cache_ttl_seconds: 5
online_store:
type: redis
connection_string: localhost:6379
offline_store:
type: fileThe registry config maps to constructor arguments for RegistryConfig Pydantic model(reference).
- In the
feature_store.yamlabove, note that there is acache_ttl_secondsof 5. This ensures that every five seconds, the feature server and push server will expire its registry cache. On the following request, it will refresh its registry by pulling from the registry path. - Feast adds a convenience wrapper so if you specify just
registry: [path], Feast will map that toRegistryConfig(path=[your path]).
By the end of this module, you will have learned how to build streaming features power real time models with Feast. Feast abstracts away the need to think about data modeling in the online store and helps you:
- maintain fresh features in the online store by
- ingesting batch features into the online store (via
feast materializeorfeast materialize-incremental) - ingesting streaming features into the online store (e.g. through
feature_store.pushor a Push server endpoint (/push))
- ingesting batch features into the online store (via
- serve features (e.g. through
feature_store.get_online_featuresor through feature servers)
Can feature / push servers refresh their registry in response to an event? e.g. after a PR merges and feast apply is run?
Unfortunately, currently the servers don't support this. Feel free to contribute a PR though to enable this! The tricky part here is that Feast would need to keep track of these servers in the registry (or in some other way), which is not the way Feast is currently designed.
Materialization in Feast by default pulls the latest feature values for each unique entity locally and writes in batches to the online store.
- Feast users can materialize multiple feature views by using the CLI:
feast materialize-incremental [FEATURE_VIEW_NAME]- Caveat: By default, Feast's registry store is a single protobuf written to a file. This means that there's the chance that metadata around materialization intervals gets lost if the registry has changed during materialization.
- The community is ideating on how to improve this. See RFC-035: Scalable Materialization
- Caveat: By default, Feast's registry store is a single protobuf written to a file. This means that there's the chance that metadata around materialization intervals gets lost if the registry has changed during materialization.
- Users often also implement their own custom providers. The provider interface has a
materialize_single_feature_viewmethod, which users are free to implement differently (e.g. materializing with Spark or Dataflow jobs).
In general, the community is actively investigating ways to speed up materialization. Contributions are welcome!

