This document covers how Feast integrates with various data sources and storage systems to support machine learning workflows. It explains the architecture and abstractions that enable Feast to read from diverse batch and streaming data sources, store features in offline stores for training, and serve features from online stores for real-time inference.
This page provides an overview of the data integration layer. For detailed information on specific topics:
Feast's data integration layer follows a pluggable architecture with clear abstractions that separate concerns between data sources, storage backends, and compute engines. The Provider abstraction coordinates these components, with PassthroughProvider being the default implementation that delegates operations to the underlying stores.
Sources: sdk/python/feast/infra/provider.py41-46 sdk/python/feast/infra/passthrough_provider.py58-130 sdk/python/feast/infra/offline_stores/offline_store.py1-49 sdk/python/feast/infra/online_stores/online_store.py1-26 sdk/python/feast/repo_config.py39-107
The following table shows how configuration types in feature_store.yaml map to concrete implementations:
| Configuration Type | Example Value | Implementation Class | Module Path |
|---|---|---|---|
offline_store.type | "bigquery" | BigQueryOfflineStore | feast.infra.offline_stores.bigquery |
offline_store.type | "snowflake.offline" | SnowflakeOfflineStore | feast.infra.offline_stores.snowflake |
offline_store.type | "redshift" | RedshiftOfflineStore | feast.infra.offline_stores.redshift |
offline_store.type | "dask" or "file" | DaskOfflineStore | feast.infra.offline_stores.dask |
online_store.type | "redis" | RedisOnlineStore | feast.infra.online_stores.redis |
online_store.type | "dynamodb" | DynamoDBOnlineStore | feast.infra.online_stores.dynamodb |
online_store.type | "sqlite" | SqliteOnlineStore | feast.infra.online_stores.sqlite |
online_store.type | "postgres" | PostgreSQLOnlineStore | feast.infra.online_stores.postgres_online_store.postgres |
batch_engine.type | "local" | LocalComputeEngine | feast.infra.compute_engines.local.compute |
batch_engine.type | "snowflake.engine" | SnowflakeComputeEngine | feast.infra.compute_engines.snowflake.snowflake_engine |
Sources: sdk/python/feast/repo_config.py39-107
Data sources represent the raw data locations where feature values originate. Feast supports both batch sources (tables/files) and streaming sources (Kafka/Kinesis). Each data source type implements the DataSource abstraction and specifies metadata like schema, timestamp columns, and connection details.
Sources: sdk/python/feast/data_source.py1-100 sdk/python/feast/feature_store.py49-55
Data sources are defined in feature views and specify where to retrieve feature values. They include essential metadata like:
For detailed information about configuring specific data sources, see Data Sources.
Offline stores are responsible for reading historical feature values from batch data sources. They implement point-in-time correct joins to generate training datasets and support materialization by pulling the latest feature values.
The OfflineStore abstract base class defines three primary operations:
get_historical_features() - Performs point-in-time joins between entity dataframes and feature views to generate training datapull_latest_from_table_or_query() - Retrieves the most recent feature values for materializationpull_all_from_table_or_query() - Retrieves all feature values within a time rangeAll offline store queries return a RetrievalJob object that represents a lazy query execution:
Sources: sdk/python/feast/infra/offline_stores/offline_store.py73-233 sdk/python/feast/infra/offline_stores/bigquery.py458-610 sdk/python/feast/infra/offline_stores/redshift.py219-395
The RetrievalJob pattern allows:
.to_df() or .to_arrow() is called)Offline stores ensure point-in-time correctness by joining features based on entity keys and timestamps, ensuring that only feature values known at prediction time are used in training data. This prevents data leakage.
The query logic uses windowing functions to select the most recent feature value before each entity's event timestamp:
Sources: sdk/python/feast/infra/offline_stores/offline_utils.py1-400 sdk/python/feast/infra/offline_stores/bigquery.py164-175
For detailed information about specific offline store implementations and their configurations, see Offline Stores.
Online stores provide low-latency access to precomputed feature values for real-time inference. They store the latest feature values indexed by entity keys and support both individual and batch reads.
The OnlineStore abstract base class defines key operations:
online_write_batch() - Writes feature values to the online storeonline_read() - Reads feature values by entity keysupdate() - Creates/updates tables for feature viewsteardown() - Removes infrastructureEntity keys are serialized into a compact binary format for efficient storage and retrieval:
Sources: sdk/python/feast/infra/key_encoding_utils.py1-100 sdk/python/feast/repo_config.py298-305
The serialization version is controlled by entity_key_serialization_version in the repo config. Version 3 (current default) includes entity key value lengths to enable deserialization.
Online stores support two primary access patterns:
Batch Writes (Materialization):
ValueProto formatBatch Reads (Inference):
ValueProto objectsNone or default valuesSources: sdk/python/feast/infra/online_stores/online_store.py1-250 sdk/python/feast/infra/online_stores/sqlite.py1-300
For detailed information about specific online store implementations and their configurations, see Online Stores.
Materialization is the process of moving feature values from offline stores to online stores, making them available for low-latency serving. This process is coordinated by the Provider and executed by ComputeEngine implementations.
Sources: sdk/python/feast/feature_store.py1399-1550 sdk/python/feast/infra/passthrough_provider.py313-440
The materialization process follows these steps:
pull_latest_from_table_or_query() or pull_all_from_table_or_query() method to retrieve feature valuesValueProto protobuf formatonline_write_batch()The materialize_incremental() method materializes only new data since the last materialization:
Sources: sdk/python/feast/feature_store.py1552-1633
This uses the last materialization timestamp stored in the registry to determine the start time for the next materialization window.
Different compute engines can be configured to handle materialization at scale:
The compute engine is specified in feature_store.yaml:
Sources: sdk/python/feast/repo_config.py46-53 sdk/python/feast/infra/passthrough_provider.py86-129
For detailed information about compute engines and advanced materialization patterns, see Materialization and Compute Engines.
Data integration is configured through feature_store.yaml and the RepoConfig class. The configuration specifies which offline store, online store, and batch engine to use, along with their specific settings.
Sources: sdk/python/feast/repo_config.py253-469
A typical production configuration using BigQuery offline and Redis online:
The configuration is loaded through several paths:
load_repo_config() reads feature_store.yamltype fieldOFFLINE_STORE_CLASS_FOR_TYPESources: sdk/python/feast/repo_config.py39-122 sdk/python/feast/repo_config.py318-469
The FeatureStore class in sdk/python/feast/feature_store.py105-2999 is the main entry point for all data integration operations:
get_historical_features() - Retrieves training data from offline storesget_online_features() - Retrieves inference features from online storesmaterialize() / materialize_incremental() - Moves data to online storeswrite_to_offline_store() - Writes data to offline storespush() - Pushes streaming features to online storesThe Provider abstract class in sdk/python/feast/infra/provider.py49-474 defines the interface for coordinating storage operations:
update_infra() - Creates/updates infrastructurematerialize_single_feature_view() - Executes materializationget_historical_features() - Delegates to offline storeonline_write_batch() / online_read() - Delegates to online storeThe PassthroughProvider in sdk/python/feast/infra/passthrough_provider.py58-554 is the default implementation that delegates to the configured stores.
OfflineStore sdk/python/feast/infra/offline_stores/offline_store.py234-417:
get_historical_features() - Point-in-time joinspull_latest_from_table_or_query() - Latest values for materializationpull_all_from_table_or_query() - All values in time rangeOnlineStore sdk/python/feast/infra/online_stores/online_store.py28-250:
online_write_batch() - Batch write operationonline_read() - Batch read operationupdate() - Infrastructure managementget_online_features() - High-level read with transformationsData sources inherit from DataSource in sdk/python/feast/data_source.py1-400 and define:
get_table_query_string() - Returns SQL query or table referenceSources: sdk/python/feast/feature_store.py105-2999 sdk/python/feast/infra/provider.py49-474 sdk/python/feast/infra/passthrough_provider.py58-554 sdk/python/feast/infra/offline_stores/offline_store.py234-417 sdk/python/feast/infra/online_stores/online_store.py28-250 sdk/python/feast/data_source.py1-400
Refresh this wiki
This wiki was recently refreshed. Please wait 4 days to refresh again.