module_1

Module 1: Streaming ingestion & online feature retrieval with Kafka, Spark, Redis

In this module, we focus on building features for online serving, and keeping them fresh with a combination of batch feature materialization and stream feature ingestion. We'll be roughly working towards the following:

Data sources: Kafka + File source
Online store: Redis
Use case: Predicting churn for drivers in real time.

Workshop

Step 1: Install Feast

First, we install Feast with Spark and Redis support:

pip install "feast[spark,redis]"

Step 2: Inspect the data

We've changed the original driver_stats.parquet to include some new fields and aggregations. You can follow along in explore_data.ipynb:

import pandas as pd
pd.read_parquet("feature_repo/data/driver_stats.parquet")

The key thing to note is that there are now a miles_driven field and a daily_miles_driven (which is a pre-computed aggregation).

Step 3: Inspect the `feature_store.yaml`

project: feast_demo_local
provider: local
registry: 
  path: data/local_registry.db
  cache_ttl_seconds: 5
online_store:
  type: redis
  connection_string: localhost:6379
offline_store:
  type: file

The key thing to note for now is the online store has been configured to be Redis. This is specifically for a single Redis node. If you want to use a Redis cluster, then you'd change this to something like:

project: feast_demo_local
provider: local
registry: 
  path: data/local_registry.db
  cache_ttl_seconds: 5
online_store:
  type: redis
  redis_type: redis_cluster
  connection_string: "redis1:6379,redis2:6379,ssl=true,password=my_password"
offline_store:
  type: file

Because we use redis-py under the hood, this means Feast also works well with hosted Redis instances like AWS Elasticache (docs).

Step 4: Spin up Kafka + Redis + Feast services

We then use Docker Compose to spin up the services we need.

This leverages a script (in kafka_demo/) that creates a topic, reads from feature_repo/data/driver_stats.parquet, generates newer timestamps, and emits them to the topic.
This also deploys an instance of Redis.
This also deploys a Feast push server (on port 6567) + a Feast feature server (on port 6566).
- These servers embed a feature_store.yaml file that enables them to connect to a remote registry. The Dockerfile mostly delegates to calling the feast serve CLI command, which instantiates a Feast python server (docs):
```
FROM python:3.7

RUN pip install "feast[redis]"

COPY feature_repo/feature_store.yaml feature_store.yaml

# Needed to reach online store within Docker network.
RUN sed -i 's/localhost:6379/redis:6379/g' feature_store.yaml
ENV FEAST_USAGE=False

CMD ["feast", "serve", "-h", "0.0.0.0"]
```

Start up the Docker daemon and then use Docker Compose to spin up the services as described above:

You may need to run sudo docker-compose up if you run into a Docker permission denied error

$ docker-compose up

Creating network "module_1_default" with the default driver
Creating zookeeper ... done
Creating redis     ... done
Creating broker               ... done
Creating feast_feature_server ... done
Creating feast_push_server    ... done
Creating kafka_events         ... done
Attaching to zookeeper, redis, broker, feast_push_server, feast_feature_server, kafka_events
...

Step 5: Why register streaming features in Feast?

Relying on streaming features in Feast enables data scientists to increase freshness of the features they rely on, decreasing training / serving skew.

A data scientist may start out their feature engineering in their notebook by directly reading from the batch source (e.g. a table they join in a data warehouse).

But then they hand this over to an engineer to productionize and realize that the model performance is different because pipeline delays lead to stale data being served.

With Feast, at training data generation time, the data scientist can directly depend on a FeatureView with a PushSource, which ensures consistent access to fresh data at serving time, thus resulting in less training / serving skew.

Understanding the PushSource

Let's take a look at an example FeatureView in this repo that uses a PushSource:

from feast import (
    FileSource,
    PushSource,
)
driver_stats = FileSource(
    name="driver_stats_source",
    path="data/driver_stats.parquet",
    timestamp_field="event_timestamp",
    created_timestamp_column="created",
    description="A table describing the stats of a driver based on hourly logs",
    owner="test2@gmail.com",
)
# A push source is useful if you have upstream systems that transform features (e.g. stream processing jobs)
driver_stats_push_source = PushSource(
    name="driver_stats_push_source", batch_source=driver_stats,
)
driver_daily_features_view = FeatureView(
    name="driver_daily_features",
    entities=["driver"],
    ttl=timedelta(seconds=8640000000),
    schema=[Field(name="daily_miles_driven", dtype=Float32),],
    online=True,
    source=driver_stats_push_source,
    tags={"production": "True"},
    owner="test2@gmail.com",
)

Using a PushSource alleviates this. The data scientist by using the driver_daily_features feature view that at serving time, the model will have access to as fresh of a value as possible. Engineers now just need to make sure that any registered PushSources have feature values being regularly pushed to make the feature consistently available.

In the future, Feast will support the concept of a StreamFeatureView as well, which simplifies the life for the engineer further. This will directly ingest from a streaming source (e.g. Kafka) and apply transformations so an engineer doesn't need to scan for PushSources and push data into Feast.

Step 6: Materialize batch features & ingest streaming features

We'll switch gears into a Jupyter notebook. This will guide you through:

Registering a FeatureView that has a single schema across both a batch source (FileSource) with aggregate features and a stream source (PushSource).
- Note: Feast will, in the future, also support directly authoring a StreamFeatureView that contains stream transformations / aggregations (e.g. via Spark, Flink, or Bytewax)
Materializing feature view values from batch sources to the online store (e.g. Redis).
Ingesting feature view values from streaming sources (e.g. window aggregate features from Spark + Kafka)
Retrieve features at low latency from Redis through Feast.
Working with a Feast push server + feature server to ingest and retrieve features through HTTP endpoints (instead of needing feature_store.yaml and FeatureStore instances)

Run the Jupyter notebook (feature_repo/workshop.ipynb).

Scheduling materialization

To ensure fresh features, you'll want to schedule materialization jobs regularly. This can be as simple as having a cron job that calls feast materialize-incremental.

Users may also be interested in integrating with Airflow, in which case you can build a custom Airflow image with the Feast SDK installed, and then use a BashOperator (with feast materialize-incremental) or PythonOperator (with store.materialize_incremental(datetime.datetime.now())):

Airflow PythonOperator

# Define Python callable
def materialize():
  repo_config = RepoConfig(
    registry=RegistryConfig(path="s3://[YOUR BUCKET]/registry.pb"),
    project="feast_demo_aws",
    provider="aws",
    offline_store="file",
    online_store=DynamoDBOnlineStoreConfig(region="us-west-2")
  )
  store = FeatureStore(config=repo_config)
  store.materialize_incremental(datetime.datetime.now())

# Use PythonOperator
materialize_python = PythonOperator(
    task_id='materialize_python',
    python_callable=materialize,
)

Airflow BashOperator

# Use BashOperator
materialize_bash = BashOperator(
    task_id='materialize',
    bash_command=f'feast materialize-incremental {datetime.datetime.now().replace(microsecond=0).isoformat()}',
)

See also FAQ: How do I speed up or scale up materialization?

A note on Feast feature servers + push servers

The above notebook introduces a way to curl an HTTP endpoint to push or retrieve features from Redis.

The servers by default cache the registry (expiring and reloading every 10 minutes). If you want to customize that time period, you can do so in feature_store.yaml.

Let's look at the feature_store.yaml used in this module (which configures the registry differently than in the previous module):

project: feast_demo_local
provider: local
registry:
  path: data/local_registry.db
  cache_ttl_seconds: 5
online_store:
  type: redis
  connection_string: localhost:6379
offline_store:
  type: file

The registry config maps to constructor arguments for RegistryConfig Pydantic model(reference).

In the feature_store.yaml above, note that there is a cache_ttl_seconds of 5. This ensures that every five seconds, the feature server and push server will expire its registry cache. On the following request, it will refresh its registry by pulling from the registry path.
Feast adds a convenience wrapper so if you specify just registry: [path], Feast will map that to RegistryConfig(path=[your path]).

Conclusion

By the end of this module, you will have learned how to build streaming features power real time models with Feast. Feast abstracts away the need to think about data modeling in the online store and helps you:

maintain fresh features in the online store by
- ingesting batch features into the online store (via feast materialize or feast materialize-incremental)
- ingesting streaming features into the online store (e.g. through feature_store.push or a Push server endpoint (/push))
serve features (e.g. through feature_store.get_online_features or through feature servers)

FAQ

Can feature / push servers refresh their registry in response to an event? e.g. after a PR merges and `feast apply` is run?

Unfortunately, currently the servers don't support this. Feel free to contribute a PR though to enable this! The tricky part here is that Feast would need to keep track of these servers in the registry (or in some other way), which is not the way Feast is currently designed.

How do I speed up or scale up materialization?

Materialization in Feast by default pulls the latest feature values for each unique entity locally and writes in batches to the online store.

Feast users can materialize multiple feature views by using the CLI: feast materialize-incremental [FEATURE_VIEW_NAME]
- Caveat: By default, Feast's registry store is a single protobuf written to a file. This means that there's the chance that metadata around materialization intervals gets lost if the registry has changed during materialization.
  - The community is ideating on how to improve this. See RFC-035: Scalable Materialization
Users often also implement their own custom providers. The provider interface has a materialize_single_feature_view method, which users are free to implement differently (e.g. materializing with Spark or Dataflow jobs).

In general, the community is actively investigating ways to speed up materialization. Contributions are welcome!

Name		Name	Last commit message	Last commit date
parent directory ..
feature_repo		feature_repo
kafka_demo		kafka_demo
README.md		README.md
architecture.png		architecture.png
dataset.png		dataset.png
docker-compose.yml		docker-compose.yml
explore_data.ipynb		explore_data.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

Module 1: Streaming ingestion & online feature retrieval with Kafka, Spark, Redis

Table of Contents

Workshop

Step 1: Install Feast

Step 2: Inspect the data

Step 3: Inspect the `feature_store.yaml`

Step 4: Spin up Kafka + Redis + Feast services

Step 5: Why register streaming features in Feast?

Understanding the PushSource

Step 6: Materialize batch features & ingest streaming features

Scheduling materialization

Airflow PythonOperator

Airflow BashOperator

A note on Feast feature servers + push servers

Conclusion

FAQ

Can feature / push servers refresh their registry in response to an event? e.g. after a PR merges and `feast apply` is run?

How do I speed up or scale up materialization?

FilesExpand file tree

module_1

Directory actions

More options

Directory actions

More options

Latest commit

History

module_1

Folders and files

parent directory

README.md

Module 1: Streaming ingestion & online feature retrieval with Kafka, Spark, Redis

Table of Contents

Workshop

Step 1: Install Feast

Step 2: Inspect the data

Step 3: Inspect the feature_store.yaml

Step 4: Spin up Kafka + Redis + Feast services

Step 5: Why register streaming features in Feast?

Understanding the PushSource

Step 6: Materialize batch features & ingest streaming features

Scheduling materialization

Airflow PythonOperator

Airflow BashOperator

A note on Feast feature servers + push servers

Conclusion

FAQ

Can feature / push servers refresh their registry in response to an event? e.g. after a PR merges and feast apply is run?

How do I speed up or scale up materialization?

Step 3: Inspect the `feature_store.yaml`

Can feature / push servers refresh their registry in response to an event? e.g. after a PR merges and `feast apply` is run?