Data Integration

Relevant source files

Purpose and Scope

This document covers how Feast integrates with various data sources and storage systems to support machine learning workflows. It explains the architecture and abstractions that enable Feast to read from diverse batch and streaming data sources, store features in offline stores for training, and serve features from online stores for real-time inference.

This page provides an overview of the data integration layer. For detailed information on specific topics:

For data source configurations, see Data Sources
For offline storage systems used in training, see Offline Stores
For online storage systems used in serving, see Online Stores
For the process of moving data between stores, see Materialization and Compute Engines

Data Integration Architecture

Feast's data integration layer follows a pluggable architecture with clear abstractions that separate concerns between data sources, storage backends, and compute engines. The Provider abstraction coordinates these components, with PassthroughProvider being the default implementation that delegates operations to the underlying stores.

Overall Data Flow Architecture

Sources: sdk/python/feast/infra/provider.py41-46 sdk/python/feast/infra/passthrough_provider.py58-130 sdk/python/feast/infra/offline_stores/offline_store.py1-49 sdk/python/feast/infra/online_stores/online_store.py1-26 sdk/python/feast/repo_config.py39-107

Configuration Mapping

The following table shows how configuration types in feature_store.yaml map to concrete implementations:

Configuration Type	Example Value	Implementation Class	Module Path
`offline_store.type`	`"bigquery"`	`BigQueryOfflineStore`	`feast.infra.offline_stores.bigquery`
`offline_store.type`	`"snowflake.offline"`	`SnowflakeOfflineStore`	`feast.infra.offline_stores.snowflake`
`offline_store.type`	`"redshift"`	`RedshiftOfflineStore`	`feast.infra.offline_stores.redshift`
`offline_store.type`	`"dask"` or `"file"`	`DaskOfflineStore`	`feast.infra.offline_stores.dask`
`online_store.type`	`"redis"`	`RedisOnlineStore`	`feast.infra.online_stores.redis`
`online_store.type`	`"dynamodb"`	`DynamoDBOnlineStore`	`feast.infra.online_stores.dynamodb`
`online_store.type`	`"sqlite"`	`SqliteOnlineStore`	`feast.infra.online_stores.sqlite`
`online_store.type`	`"postgres"`	`PostgreSQLOnlineStore`	`feast.infra.online_stores.postgres_online_store.postgres`
`batch_engine.type`	`"local"`	`LocalComputeEngine`	`feast.infra.compute_engines.local.compute`
`batch_engine.type`	`"snowflake.engine"`	`SnowflakeComputeEngine`	`feast.infra.compute_engines.snowflake.snowflake_engine`

Sources: sdk/python/feast/repo_config.py39-107

Data Sources

Data sources represent the raw data locations where feature values originate. Feast supports both batch sources (tables/files) and streaming sources (Kafka/Kinesis). Each data source type implements the DataSource abstraction and specifies metadata like schema, timestamp columns, and connection details.

Supported Data Source Types

Sources: sdk/python/feast/data_source.py1-100 sdk/python/feast/feature_store.py49-55

Data sources are defined in feature views and specify where to retrieve feature values. They include essential metadata like:

Table/file location or query
Event timestamp column for temporal joins
Created timestamp column for deduplication
Field mappings between source and feature view schemas

For detailed information about configuring specific data sources, see Data Sources.

Offline Stores

Offline stores are responsible for reading historical feature values from batch data sources. They implement point-in-time correct joins to generate training datasets and support materialization by pulling the latest feature values.

OfflineStore Interface

The OfflineStore abstract base class defines three primary operations:

get_historical_features() - Performs point-in-time joins between entity dataframes and feature views to generate training data
pull_latest_from_table_or_query() - Retrieves the most recent feature values for materialization
pull_all_from_table_or_query() - Retrieves all feature values within a time range

RetrievalJob Pattern

All offline store queries return a RetrievalJob object that represents a lazy query execution:

Sources: sdk/python/feast/infra/offline_stores/offline_store.py73-233 sdk/python/feast/infra/offline_stores/bigquery.py458-610 sdk/python/feast/infra/offline_stores/redshift.py219-395

The RetrievalJob pattern allows:

Lazy query execution (queries don't run until .to_df() or .to_arrow() is called)
Applying on-demand transformations after data retrieval
Multiple output formats (pandas, arrow, SQL)
Persisting results to saved datasets

Point-in-Time Correctness

Offline stores ensure point-in-time correctness by joining features based on entity keys and timestamps, ensuring that only feature values known at prediction time are used in training data. This prevents data leakage.

The query logic uses windowing functions to select the most recent feature value before each entity's event timestamp:

Sources: sdk/python/feast/infra/offline_stores/offline_utils.py1-400 sdk/python/feast/infra/offline_stores/bigquery.py164-175

For detailed information about specific offline store implementations and their configurations, see Offline Stores.

Online Stores

Online stores provide low-latency access to precomputed feature values for real-time inference. They store the latest feature values indexed by entity keys and support both individual and batch reads.

OnlineStore Interface

The OnlineStore abstract base class defines key operations:

online_write_batch() - Writes feature values to the online store
online_read() - Reads feature values by entity keys
update() - Creates/updates tables for feature views
teardown() - Removes infrastructure

Entity Key Serialization

Entity keys are serialized into a compact binary format for efficient storage and retrieval:

Sources: sdk/python/feast/infra/key_encoding_utils.py1-100 sdk/python/feast/repo_config.py298-305

The serialization version is controlled by entity_key_serialization_version in the repo config. Version 3 (current default) includes entity key value lengths to enable deserialization.

Write and Read Patterns

Online stores support two primary access patterns:

Batch Writes (Materialization):

Writes occur in batches during materialization jobs
Features from offline stores are transformed to ValueProto format
Entity keys are serialized using the configured version

Batch Reads (Inference):

Multiple entity keys can be read in a single request
Features are returned as ValueProto objects
Missing features may return None or default values

Sources: sdk/python/feast/infra/online_stores/online_store.py1-250 sdk/python/feast/infra/online_stores/sqlite.py1-300

For detailed information about specific online store implementations and their configurations, see Online Stores.

Materialization Pipeline

Materialization is the process of moving feature values from offline stores to online stores, making them available for low-latency serving. This process is coordinated by the Provider and executed by ComputeEngine implementations.

Materialization Data Flow

Sources: sdk/python/feast/feature_store.py1399-1550 sdk/python/feast/infra/passthrough_provider.py313-440

Materialization Steps

The materialization process follows these steps:

Pull Features: The compute engine calls the offline store's pull_latest_from_table_or_query() or pull_all_from_table_or_query() method to retrieve feature values
Convert Format: Feature values are converted from Arrow/Pandas format to ValueProto protobuf format
Serialize Keys: Entity keys are serialized using the configured serialization version
Write Batch: Features are written to the online store in batches using online_write_batch()
Update Metadata: Materialization timestamps are recorded in the registry

Incremental Materialization

The materialize_incremental() method materializes only new data since the last materialization:

Sources: sdk/python/feast/feature_store.py1552-1633

This uses the last materialization timestamp stored in the registry to determine the start time for the next materialization window.

Compute Engine Abstraction

Different compute engines can be configured to handle materialization at scale:

LocalComputeEngine: In-process materialization (default)
SnowflakeComputeEngine: Uses Snowflake's compute for large-scale materialization
LambdaComputeEngine: Distributes materialization across AWS Lambda functions

The compute engine is specified in feature_store.yaml:

Sources: sdk/python/feast/repo_config.py46-53 sdk/python/feast/infra/passthrough_provider.py86-129

For detailed information about compute engines and advanced materialization patterns, see Materialization and Compute Engines.

Configuration

Data integration is configured through feature_store.yaml and the RepoConfig class. The configuration specifies which offline store, online store, and batch engine to use, along with their specific settings.

Configuration Structure

Sources: sdk/python/feast/repo_config.py253-469

Example Configuration

A typical production configuration using BigQuery offline and Redis online:

Configuration Loading

The configuration is loaded through several paths:

load_repo_config() reads feature_store.yaml
Configuration dictionaries are validated using Pydantic models
Type-specific config classes are instantiated based on the type field
Configuration class mappings are defined in constants like OFFLINE_STORE_CLASS_FOR_TYPE

Sources: sdk/python/feast/repo_config.py39-122 sdk/python/feast/repo_config.py318-469

Key Abstractions and Code Entry Points

FeatureStore Class

The FeatureStore class in sdk/python/feast/feature_store.py105-2999 is the main entry point for all data integration operations:

get_historical_features() - Retrieves training data from offline stores
get_online_features() - Retrieves inference features from online stores
materialize() / materialize_incremental() - Moves data to online stores
write_to_offline_store() - Writes data to offline stores
push() - Pushes streaming features to online stores

Provider Abstraction

The Provider abstract class in sdk/python/feast/infra/provider.py49-474 defines the interface for coordinating storage operations:

update_infra() - Creates/updates infrastructure
materialize_single_feature_view() - Executes materialization
get_historical_features() - Delegates to offline store
online_write_batch() / online_read() - Delegates to online store

The PassthroughProvider in sdk/python/feast/infra/passthrough_provider.py58-554 is the default implementation that delegates to the configured stores.

Storage Abstractions

OfflineStore sdk/python/feast/infra/offline_stores/offline_store.py234-417:

get_historical_features() - Point-in-time joins
pull_latest_from_table_or_query() - Latest values for materialization
pull_all_from_table_or_query() - All values in time range

OnlineStore sdk/python/feast/infra/online_stores/online_store.py28-250:

online_write_batch() - Batch write operation
online_read() - Batch read operation
update() - Infrastructure management
get_online_features() - High-level read with transformations

Data Source Classes

Data sources inherit from DataSource in sdk/python/feast/data_source.py1-400 and define:

get_table_query_string() - Returns SQL query or table reference
Schema metadata (timestamp columns, fields)
Connection details specific to the source type

Sources: sdk/python/feast/feature_store.py105-2999 sdk/python/feast/infra/provider.py49-474 sdk/python/feast/infra/passthrough_provider.py58-554 sdk/python/feast/infra/offline_stores/offline_store.py234-417 sdk/python/feast/infra/online_stores/online_store.py28-250 sdk/python/feast/data_source.py1-400

Data Integration

Relevant source files

Purpose and Scope

This page provides an overview of the data integration layer. For detailed information on specific topics:

For data source configurations, see Data Sources
For offline storage systems used in training, see Offline Stores
For online storage systems used in serving, see Online Stores
For the process of moving data between stores, see Materialization and Compute Engines

Data Integration Architecture

Overall Data Flow Architecture

Configuration Mapping

The following table shows how configuration types in feature_store.yaml map to concrete implementations:

Configuration Type	Example Value	Implementation Class	Module Path
`offline_store.type`	`"bigquery"`	`BigQueryOfflineStore`	`feast.infra.offline_stores.bigquery`
`offline_store.type`	`"snowflake.offline"`	`SnowflakeOfflineStore`	`feast.infra.offline_stores.snowflake`
`offline_store.type`	`"redshift"`	`RedshiftOfflineStore`	`feast.infra.offline_stores.redshift`
`offline_store.type`	`"dask"` or `"file"`	`DaskOfflineStore`	`feast.infra.offline_stores.dask`
`online_store.type`	`"redis"`	`RedisOnlineStore`	`feast.infra.online_stores.redis`
`online_store.type`	`"dynamodb"`	`DynamoDBOnlineStore`	`feast.infra.online_stores.dynamodb`
`online_store.type`	`"sqlite"`	`SqliteOnlineStore`	`feast.infra.online_stores.sqlite`
`online_store.type`	`"postgres"`	`PostgreSQLOnlineStore`	`feast.infra.online_stores.postgres_online_store.postgres`
`batch_engine.type`	`"local"`	`LocalComputeEngine`	`feast.infra.compute_engines.local.compute`
`batch_engine.type`	`"snowflake.engine"`	`SnowflakeComputeEngine`	`feast.infra.compute_engines.snowflake.snowflake_engine`

Sources: sdk/python/feast/repo_config.py39-107

Data Sources

Supported Data Source Types

Sources: sdk/python/feast/data_source.py1-100 sdk/python/feast/feature_store.py49-55

Data sources are defined in feature views and specify where to retrieve feature values. They include essential metadata like:

Table/file location or query
Event timestamp column for temporal joins
Created timestamp column for deduplication
Field mappings between source and feature view schemas

For detailed information about configuring specific data sources, see Data Sources.

Offline Stores

OfflineStore Interface

The OfflineStore abstract base class defines three primary operations:

get_historical_features() - Performs point-in-time joins between entity dataframes and feature views to generate training data
pull_latest_from_table_or_query() - Retrieves the most recent feature values for materialization
pull_all_from_table_or_query() - Retrieves all feature values within a time range

RetrievalJob Pattern

All offline store queries return a RetrievalJob object that represents a lazy query execution:

Sources: sdk/python/feast/infra/offline_stores/offline_store.py73-233 sdk/python/feast/infra/offline_stores/bigquery.py458-610 sdk/python/feast/infra/offline_stores/redshift.py219-395

The RetrievalJob pattern allows:

Lazy query execution (queries don't run until .to_df() or .to_arrow() is called)
Applying on-demand transformations after data retrieval
Multiple output formats (pandas, arrow, SQL)
Persisting results to saved datasets

Point-in-Time Correctness

The query logic uses windowing functions to select the most recent feature value before each entity's event timestamp:

Sources: sdk/python/feast/infra/offline_stores/offline_utils.py1-400 sdk/python/feast/infra/offline_stores/bigquery.py164-175

For detailed information about specific offline store implementations and their configurations, see Offline Stores.

Online Stores

OnlineStore Interface

The OnlineStore abstract base class defines key operations:

online_write_batch() - Writes feature values to the online store
online_read() - Reads feature values by entity keys
update() - Creates/updates tables for feature views
teardown() - Removes infrastructure

Entity Key Serialization

Entity keys are serialized into a compact binary format for efficient storage and retrieval:

Sources: sdk/python/feast/infra/key_encoding_utils.py1-100 sdk/python/feast/repo_config.py298-305

The serialization version is controlled by entity_key_serialization_version in the repo config. Version 3 (current default) includes entity key value lengths to enable deserialization.

Write and Read Patterns

Online stores support two primary access patterns:

Batch Writes (Materialization):

Writes occur in batches during materialization jobs
Features from offline stores are transformed to ValueProto format
Entity keys are serialized using the configured version

Batch Reads (Inference):

Multiple entity keys can be read in a single request
Features are returned as ValueProto objects
Missing features may return None or default values

Sources: sdk/python/feast/infra/online_stores/online_store.py1-250 sdk/python/feast/infra/online_stores/sqlite.py1-300

For detailed information about specific online store implementations and their configurations, see Online Stores.

Materialization Pipeline

Materialization Data Flow

Sources: sdk/python/feast/feature_store.py1399-1550 sdk/python/feast/infra/passthrough_provider.py313-440

Materialization Steps

The materialization process follows these steps:

Pull Features: The compute engine calls the offline store's pull_latest_from_table_or_query() or pull_all_from_table_or_query() method to retrieve feature values
Convert Format: Feature values are converted from Arrow/Pandas format to ValueProto protobuf format
Serialize Keys: Entity keys are serialized using the configured serialization version
Write Batch: Features are written to the online store in batches using online_write_batch()
Update Metadata: Materialization timestamps are recorded in the registry

Incremental Materialization

The materialize_incremental() method materializes only new data since the last materialization:

Sources: sdk/python/feast/feature_store.py1552-1633

This uses the last materialization timestamp stored in the registry to determine the start time for the next materialization window.

Compute Engine Abstraction

Different compute engines can be configured to handle materialization at scale:

LocalComputeEngine: In-process materialization (default)
SnowflakeComputeEngine: Uses Snowflake's compute for large-scale materialization
LambdaComputeEngine: Distributes materialization across AWS Lambda functions

The compute engine is specified in feature_store.yaml:

Sources: sdk/python/feast/repo_config.py46-53 sdk/python/feast/infra/passthrough_provider.py86-129

For detailed information about compute engines and advanced materialization patterns, see Materialization and Compute Engines.

Configuration

Configuration Structure

Sources: sdk/python/feast/repo_config.py253-469

Example Configuration

A typical production configuration using BigQuery offline and Redis online:

Configuration Loading

The configuration is loaded through several paths:

load_repo_config() reads feature_store.yaml
Configuration dictionaries are validated using Pydantic models
Type-specific config classes are instantiated based on the type field
Configuration class mappings are defined in constants like OFFLINE_STORE_CLASS_FOR_TYPE

Sources: sdk/python/feast/repo_config.py39-122 sdk/python/feast/repo_config.py318-469

Key Abstractions and Code Entry Points

FeatureStore Class

The FeatureStore class in sdk/python/feast/feature_store.py105-2999 is the main entry point for all data integration operations:

get_historical_features() - Retrieves training data from offline stores
get_online_features() - Retrieves inference features from online stores
materialize() / materialize_incremental() - Moves data to online stores
write_to_offline_store() - Writes data to offline stores
push() - Pushes streaming features to online stores

Provider Abstraction

The Provider abstract class in sdk/python/feast/infra/provider.py49-474 defines the interface for coordinating storage operations:

update_infra() - Creates/updates infrastructure
materialize_single_feature_view() - Executes materialization
get_historical_features() - Delegates to offline store
online_write_batch() / online_read() - Delegates to online store

The PassthroughProvider in sdk/python/feast/infra/passthrough_provider.py58-554 is the default implementation that delegates to the configured stores.

Storage Abstractions

OfflineStore sdk/python/feast/infra/offline_stores/offline_store.py234-417:

get_historical_features() - Point-in-time joins
pull_latest_from_table_or_query() - Latest values for materialization
pull_all_from_table_or_query() - All values in time range

OnlineStore sdk/python/feast/infra/online_stores/online_store.py28-250:

online_write_batch() - Batch write operation
online_read() - Batch read operation
update() - Infrastructure management
get_online_features() - High-level read with transformations

Data Source Classes

Data sources inherit from DataSource in sdk/python/feast/data_source.py1-400 and define:

get_table_query_string() - Returns SQL query or table reference
Schema metadata (timestamp columns, fields)
Connection details specific to the source type

Data Integration

Purpose and Scope

Data Integration Architecture

Overall Data Flow Architecture

Configuration Mapping

Data Sources

Supported Data Source Types

Offline Stores

OfflineStore Interface

RetrievalJob Pattern

Point-in-Time Correctness

Online Stores

OnlineStore Interface

Entity Key Serialization

Write and Read Patterns

Materialization Pipeline

Materialization Data Flow

Materialization Steps

Incremental Materialization

Compute Engine Abstraction

Configuration

Configuration Structure

Example Configuration

Configuration Loading

Key Abstractions and Code Entry Points

FeatureStore Class

Provider Abstraction

Storage Abstractions

Data Source Classes

On this page

Data Integration

Purpose and Scope

Data Integration Architecture

Overall Data Flow Architecture

Configuration Mapping

Data Sources

Supported Data Source Types

Offline Stores

OfflineStore Interface

RetrievalJob Pattern

Point-in-Time Correctness

Online Stores

OnlineStore Interface

Entity Key Serialization

Write and Read Patterns

Materialization Pipeline

Materialization Data Flow

Materialization Steps

Incremental Materialization

Compute Engine Abstraction

Configuration

Configuration Structure

Example Configuration

Configuration Loading

Key Abstractions and Code Entry Points

FeatureStore Class

Provider Abstraction

Storage Abstractions

Data Source Classes

On this page