Data Sources

Relevant source files

Data sources define where Feast reads feature data from. They specify the location and schema of raw feature data, which can reside in various storage systems (data warehouses, object storage, streaming platforms). Data sources are the foundation for defining Feature Views and enable Feast to perform point-in-time correct joins for training data generation and to materialize features to the online store.

For information about how data sources are used in offline stores for historical retrieval, see Offline Stores. For information about how Feature Views use data sources, see Feature Views and Services.

Overview

Sources: sdk/python/feast/data_source.py1-100 sdk/python/feast/feature_store.py49-55 sdk/python/feast/repo_operations.py136-220

Data Source Base Class

The DataSource class is the abstract base class for all data sources in Feast. It defines the common interface that all data source implementations must follow.

Core Attributes

Attribute	Type	Purpose
`name`	str	Unique identifier for the data source
`timestamp_field`	str	Column name containing event timestamps
`created_timestamp_column`	Optional[str]	Column name containing creation timestamps (for deduplication)
`field_mapping`	Dict[str, str]	Maps physical column names to feature names
`tags`	Dict[str, str]	User-defined metadata tags

The timestamp_field is required for all batch data sources and is used for point-in-time correct joins. The created_timestamp_column is optional and provides a tiebreaker when multiple records have the same event timestamp.

Data Source Registration Flow:

Sources: sdk/python/feast/repo_operations.py114-220 sdk/python/feast/feature_store.py944-1075

Batch Data Sources

Batch data sources represent static datasets used for training data generation and batch materialization. They support point-in-time correct joins through the timestamp fields.

BigQuery Data Source

BigQuerySource reads data from Google BigQuery tables or queries.

Configuration:

Key Implementation Details:

Can reference either a table (project.dataset.table) or a SQL query
Supports both standard SQL and legacy SQL
Uses the BigQuery Storage API for efficient data retrieval
Handles partitioning and clustering for optimization

Example Query Generation (simplified from sdk/python/feast/infra/offline_stores/bigquery.py164-176):

SELECT {field_string}
FROM (
    SELECT {field_string},
    ROW_NUMBER() OVER(PARTITION BY {join_keys} ORDER BY {timestamps} DESC) AS _feast_row
    FROM {table_or_query}
    WHERE {timestamp_field} BETWEEN TIMESTAMP('{start}') AND TIMESTAMP('{end}')
)
WHERE _feast_row = 1

This query pattern retrieves the latest feature values within a time range, using ROW_NUMBER() for deduplication.

Sources: sdk/python/feast/infra/offline_stores/bigquery.py85-456 sdk/python/tests/integration/feature_repos/universal/data_sources/bigquery.py1-100

Snowflake Data Source

SnowflakeSource reads data from Snowflake tables or queries.

Configuration:

database: Snowflake database name
schema: Snowflake schema name
table or query: Table reference or SQL query
Connection credentials from config or environment variables

Key Features:

Supports Snowflake-specific SQL syntax
Can use Snowflake stages for data export
Integrates with Snowflake's compute warehouses

Sources: sdk/python/tests/integration/feature_repos/universal/data_sources/snowflake.py1-100

Redshift Data Source

RedshiftSource reads data from AWS Redshift clusters.

Configuration Options:

Query Pattern (sdk/python/feast/infra/offline_stores/redshift.py133-148):

Uses ROW_NUMBER() window function for deduplication
Supports both table references and SQL queries
Requires S3 staging for data import/export operations
Uses Redshift Data API for query execution

Sources: sdk/python/feast/infra/offline_stores/redshift.py47-93 sdk/python/tests/integration/feature_repos/universal/data_sources/redshift.py1-100

File-Based Data Sources

FileSource reads data from file systems, supporting various formats.

Supported Formats:

Format	Class	Use Case
Parquet	`ParquetFormat`	Default format, efficient columnar storage
Delta	`DeltaFormat`	Delta Lake tables with versioning
CSV	(via Pandas)	Legacy data, less efficient

Storage Locations:

Local file system paths
S3 URIs (s3://bucket/path)
GCS URIs (gs://bucket/path)
Azure Blob Storage URIs (wasbs://...)

Offline Store Variants:

File Data Source Configuration (sdk/python/tests/integration/feature_repos/universal/data_sources/file.py56-79):

Sources: sdk/python/tests/integration/feature_repos/universal/data_sources/file.py45-95 sdk/python/feast/infra/offline_stores/dask.py1-50

Stream Data Sources

Stream data sources enable real-time feature ingestion from streaming platforms. They are used with StreamFeatureView to define features that require low-latency updates.

Stream Source Types

PushSource

PushSource allows applications to push features directly to Feast via HTTP/gRPC endpoints.

Characteristics:

Requires an embedded batch_source for historical data
Supports ONLINE (push to online store only) or OFFLINE (push to offline store) modes
Does not require external streaming infrastructure

Data Source Registration with PushSource (sdk/python/feast/repo_operations.py147-156):

Sources: sdk/python/feast/data_source.py50-54 sdk/python/feast/feature_store.py49-55

KafkaSource

KafkaSource consumes feature data from Apache Kafka topics.

Configuration:

kafka_bootstrap_servers: Kafka broker addresses
topic: Kafka topic name
message_format: Serialization format (Avro, JSON, Protobuf)
batch_source: Companion batch source for historical data

Sources: sdk/python/feast/data_source.py51

KinesisSource

KinesisSource consumes feature data from AWS Kinesis streams.

Configuration:

stream_name: Kinesis stream name
region: AWS region
batch_source: Companion batch source for historical data

Sources: sdk/python/feast/data_source.py52

Data Source Configuration Properties

Timestamp Fields

Timestamp fields are critical for point-in-time correctness in Feast.

Timestamp Field Semantics:

Inference Logic (sdk/python/feast/inference.py1-100):

If timestamp_field not specified, Feast attempts to infer it from data source schema
Looks for columns with datetime types
Falls back to common naming patterns (event_timestamp, ts, timestamp)

Sources: sdk/python/feast/inference.py70-73 sdk/python/feast/infra/offline_stores/offline_utils.py28-44

Field Mapping

Field mapping allows renaming columns from the physical data source to feature names.

Use Cases:

Standardization: Map various source column names to consistent feature names
Legacy Support: Adapt old schemas to new naming conventions
Multi-source Features: Handle name conflicts when joining multiple sources

Example:

Implementation (sdk/python/feast/utils.py200-250):

Applied during data retrieval from offline stores
Transforms column names before feature computation
Used in both historical retrieval and materialization

Sources: sdk/python/tests/integration/feature_repos/universal/data_sources/file.py77-78

Data Source Lifecycle

Registration and Discovery

Key Files Involved:

sdk/python/feast/repo_operations.py114-220 - Repository parsing logic
sdk/python/feast/feature_store.py944-1075 - Apply logic
sdk/python/feast/infra/registry/base_registry.py1-100 - Registry storage

Sources: sdk/python/feast/repo_operations.py136-156 sdk/python/feast/feature_store.py851-859

Data Source Validation

Data sources are validated to ensure compatibility with the configured provider and offline store.

Validation Steps:

Provider-Specific Validation:

Each provider validates data sources it supports
BigQuery provider validates BigQuerySource only
AWS provider validates RedshiftSource only
Local provider validates FileSource only

Sources: sdk/python/feast/repo_operations.py236-240 sdk/python/feast/feature_store.py346-351

Usage in Feature Views

Data sources are consumed by Feature Views to define the source of feature data.

Integration Pattern:

FeatureView Validation (sdk/python/feast/feature_store.py690-701):

Validates that batch_source timestamp fields match FeatureView expectations
Updates data sources with inferred timestamp columns if not specified
Ensures compatibility between source schema and feature schema

Sources: sdk/python/feast/feature_store.py665-727 sdk/python/feast/feature_view.py1-100

Data Source Retrieval Methods

Offline stores provide different retrieval methods for data sources:

Pull Latest from Table or Query

Retrieves the most recent feature values within a time range for materialization.

Method Signature:

Query Pattern (common across BigQuery, Redshift, Snowflake):

Partition data by entity keys
Order by timestamp descending (with created timestamp as tiebreaker)
Select row_number = 1 (most recent)
Filter by time range

Sources: sdk/python/feast/infra/offline_stores/bigquery.py127-183 sdk/python/feast/infra/offline_stores/redshift.py97-148

Pull All from Table or Query

Retrieves all feature values within a time range (no deduplication).

Use Cases:

Bulk historical analysis
Feature engineering exploration
Data quality checks

Implementation Difference:

No ROW_NUMBER() window function
Simple time-based filtering
Returns all rows, not just latest

Sources: sdk/python/feast/infra/offline_stores/bigquery.py186-232

Get Historical Features

Performs point-in-time correct joins between entity dataframe and feature data.

Complex Join Logic:

Sources: sdk/python/feast/infra/offline_stores/bigquery.py235-340 sdk/python/feast/infra/offline_stores/offline_utils.py1-100

Testing Data Sources

Feast provides a universal testing framework for data sources that works across different storage backends.

DataSourceCreator Pattern

Test Data Flow (sdk/python/tests/conftest.py232-280):

Test framework selects appropriate DataSourceCreator based on offline store type
Creator generates test data and uploads to storage system
Creates DataSource instances pointing to test data
Test executes using universal test dataset
Creator cleans up resources in teardown

Universal Test Datasets (sdk/python/tests/integration/feature_repos/repo_configuration.py258-299):

customer_df: Customer profile features
driver_df: Driver statistics features
location_df: Location-based features
orders_df: Order transaction features
global_df: Global aggregation features

Sources: sdk/python/tests/integration/feature_repos/universal/data_source_creator.py1-50 sdk/python/tests/integration/feature_repos/repo_configuration.py315-362

Data Source Registry Operations

Data sources are stored in the registry for metadata management.

Registry Methods:

Method	Purpose
`list_data_sources()`	Retrieve all data sources for a project
`get_data_source()`	Retrieve a specific data source by name
`apply_data_source()`	Register or update a data source
`delete_data_source()`	Remove a data source from registry

List Data Sources (sdk/python/feast/feature_store.py462-477):

Get Data Source (sdk/python/feast/feature_store.py598-611):

Sources: sdk/python/feast/feature_store.py462-477 sdk/python/feast/feature_store.py598-611 sdk/python/feast/errors.py155-160

Best Practices

Data Source Naming

Use descriptive names that indicate the data content and source system
Prefix with source type for clarity: bq_customer_profiles, s3_driver_stats
Avoid name collisions across different source types

Name Validation (sdk/python/feast/errors.py86-90):

Timestamp Field Selection

Always use UTC timestamps for consistency
Prefer timestamp columns with millisecond or higher precision
Include created_timestamp_column when source data may have duplicates at the same event time

Field Mapping Strategy

Use field mapping sparingly - prefer consistent naming in source systems
Document all field mappings in feature repository
Validate that mapped fields exist in the source schema

Performance Considerations

For BigQuery:

Use table partitioning on timestamp fields
Specify billing_project_id separately from project_id for cost management
Use gcs_staging_location for large result sets

For Redshift:

Configure s3_staging_location for efficient data transfer
Use appropriate iam_role with minimal required permissions
Prefer provisioned clusters for consistent performance

For File Sources:

Use Parquet format for efficient columnar access
Partition files by date for better query performance
Consider DuckDB for large local datasets

Sources: sdk/python/feast/infra/offline_stores/bigquery.py85-123 sdk/python/feast/infra/offline_stores/redshift.py47-92

Data Sources

Relevant source files

Overview

Sources: sdk/python/feast/data_source.py1-100 sdk/python/feast/feature_store.py49-55 sdk/python/feast/repo_operations.py136-220

Data Source Base Class

The DataSource class is the abstract base class for all data sources in Feast. It defines the common interface that all data source implementations must follow.

Core Attributes

Attribute	Type	Purpose
`name`	str	Unique identifier for the data source
`timestamp_field`	str	Column name containing event timestamps
`created_timestamp_column`	Optional[str]	Column name containing creation timestamps (for deduplication)
`field_mapping`	Dict[str, str]	Maps physical column names to feature names
`tags`	Dict[str, str]	User-defined metadata tags

Data Source Registration Flow:

Sources: sdk/python/feast/repo_operations.py114-220 sdk/python/feast/feature_store.py944-1075

Batch Data Sources

Batch data sources represent static datasets used for training data generation and batch materialization. They support point-in-time correct joins through the timestamp fields.

BigQuery Data Source

BigQuerySource reads data from Google BigQuery tables or queries.

Configuration:

Key Implementation Details:

Can reference either a table (project.dataset.table) or a SQL query
Supports both standard SQL and legacy SQL
Uses the BigQuery Storage API for efficient data retrieval
Handles partitioning and clustering for optimization

Example Query Generation (simplified from sdk/python/feast/infra/offline_stores/bigquery.py164-176):

SELECT {field_string}
FROM (
    SELECT {field_string},
    ROW_NUMBER() OVER(PARTITION BY {join_keys} ORDER BY {timestamps} DESC) AS _feast_row
    FROM {table_or_query}
    WHERE {timestamp_field} BETWEEN TIMESTAMP('{start}') AND TIMESTAMP('{end}')
)
WHERE _feast_row = 1

This query pattern retrieves the latest feature values within a time range, using ROW_NUMBER() for deduplication.

Sources: sdk/python/feast/infra/offline_stores/bigquery.py85-456 sdk/python/tests/integration/feature_repos/universal/data_sources/bigquery.py1-100

Snowflake Data Source

SnowflakeSource reads data from Snowflake tables or queries.

Configuration:

database: Snowflake database name
schema: Snowflake schema name
table or query: Table reference or SQL query
Connection credentials from config or environment variables

Key Features:

Supports Snowflake-specific SQL syntax
Can use Snowflake stages for data export
Integrates with Snowflake's compute warehouses

Sources: sdk/python/tests/integration/feature_repos/universal/data_sources/snowflake.py1-100

Redshift Data Source

RedshiftSource reads data from AWS Redshift clusters.

Configuration Options:

Query Pattern (sdk/python/feast/infra/offline_stores/redshift.py133-148):

Uses ROW_NUMBER() window function for deduplication
Supports both table references and SQL queries
Requires S3 staging for data import/export operations
Uses Redshift Data API for query execution

Sources: sdk/python/feast/infra/offline_stores/redshift.py47-93 sdk/python/tests/integration/feature_repos/universal/data_sources/redshift.py1-100

File-Based Data Sources

FileSource reads data from file systems, supporting various formats.

Supported Formats:

Format	Class	Use Case
Parquet	`ParquetFormat`	Default format, efficient columnar storage
Delta	`DeltaFormat`	Delta Lake tables with versioning
CSV	(via Pandas)	Legacy data, less efficient

Storage Locations:

Local file system paths
S3 URIs (s3://bucket/path)
GCS URIs (gs://bucket/path)
Azure Blob Storage URIs (wasbs://...)

Offline Store Variants:

File Data Source Configuration (sdk/python/tests/integration/feature_repos/universal/data_sources/file.py56-79):

Sources: sdk/python/tests/integration/feature_repos/universal/data_sources/file.py45-95 sdk/python/feast/infra/offline_stores/dask.py1-50

Stream Data Sources

Stream data sources enable real-time feature ingestion from streaming platforms. They are used with StreamFeatureView to define features that require low-latency updates.

Stream Source Types

PushSource

PushSource allows applications to push features directly to Feast via HTTP/gRPC endpoints.

Characteristics:

Requires an embedded batch_source for historical data
Supports ONLINE (push to online store only) or OFFLINE (push to offline store) modes
Does not require external streaming infrastructure

Data Source Registration with PushSource (sdk/python/feast/repo_operations.py147-156):

Sources: sdk/python/feast/data_source.py50-54 sdk/python/feast/feature_store.py49-55

KafkaSource

KafkaSource consumes feature data from Apache Kafka topics.

Configuration:

kafka_bootstrap_servers: Kafka broker addresses
topic: Kafka topic name
message_format: Serialization format (Avro, JSON, Protobuf)
batch_source: Companion batch source for historical data

Sources: sdk/python/feast/data_source.py51

KinesisSource

KinesisSource consumes feature data from AWS Kinesis streams.

Configuration:

stream_name: Kinesis stream name
region: AWS region
batch_source: Companion batch source for historical data

Sources: sdk/python/feast/data_source.py52

Data Source Configuration Properties

Timestamp Fields

Timestamp fields are critical for point-in-time correctness in Feast.

Timestamp Field Semantics:

Inference Logic (sdk/python/feast/inference.py1-100):

If timestamp_field not specified, Feast attempts to infer it from data source schema
Looks for columns with datetime types
Falls back to common naming patterns (event_timestamp, ts, timestamp)

Sources: sdk/python/feast/inference.py70-73 sdk/python/feast/infra/offline_stores/offline_utils.py28-44

Field Mapping

Field mapping allows renaming columns from the physical data source to feature names.

Use Cases:

Standardization: Map various source column names to consistent feature names
Legacy Support: Adapt old schemas to new naming conventions
Multi-source Features: Handle name conflicts when joining multiple sources

Example:

Implementation (sdk/python/feast/utils.py200-250):

Applied during data retrieval from offline stores
Transforms column names before feature computation
Used in both historical retrieval and materialization

Sources: sdk/python/tests/integration/feature_repos/universal/data_sources/file.py77-78

Data Source Lifecycle

Registration and Discovery

Key Files Involved:

sdk/python/feast/repo_operations.py114-220 - Repository parsing logic
sdk/python/feast/feature_store.py944-1075 - Apply logic
sdk/python/feast/infra/registry/base_registry.py1-100 - Registry storage

Sources: sdk/python/feast/repo_operations.py136-156 sdk/python/feast/feature_store.py851-859

Data Source Validation

Data sources are validated to ensure compatibility with the configured provider and offline store.

Validation Steps:

Provider-Specific Validation:

Each provider validates data sources it supports
BigQuery provider validates BigQuerySource only
AWS provider validates RedshiftSource only
Local provider validates FileSource only

Sources: sdk/python/feast/repo_operations.py236-240 sdk/python/feast/feature_store.py346-351

Usage in Feature Views

Data sources are consumed by Feature Views to define the source of feature data.

Integration Pattern:

FeatureView Validation (sdk/python/feast/feature_store.py690-701):

Validates that batch_source timestamp fields match FeatureView expectations
Updates data sources with inferred timestamp columns if not specified
Ensures compatibility between source schema and feature schema

Sources: sdk/python/feast/feature_store.py665-727 sdk/python/feast/feature_view.py1-100

Data Source Retrieval Methods

Offline stores provide different retrieval methods for data sources:

Pull Latest from Table or Query

Retrieves the most recent feature values within a time range for materialization.

Method Signature:

Query Pattern (common across BigQuery, Redshift, Snowflake):

Partition data by entity keys
Order by timestamp descending (with created timestamp as tiebreaker)
Select row_number = 1 (most recent)
Filter by time range

Sources: sdk/python/feast/infra/offline_stores/bigquery.py127-183 sdk/python/feast/infra/offline_stores/redshift.py97-148

Pull All from Table or Query

Retrieves all feature values within a time range (no deduplication).

Use Cases:

Bulk historical analysis
Feature engineering exploration
Data quality checks

Implementation Difference:

No ROW_NUMBER() window function
Simple time-based filtering
Returns all rows, not just latest

Sources: sdk/python/feast/infra/offline_stores/bigquery.py186-232

Get Historical Features

Performs point-in-time correct joins between entity dataframe and feature data.

Complex Join Logic:

Sources: sdk/python/feast/infra/offline_stores/bigquery.py235-340 sdk/python/feast/infra/offline_stores/offline_utils.py1-100

Testing Data Sources

Feast provides a universal testing framework for data sources that works across different storage backends.

DataSourceCreator Pattern

Test Data Flow (sdk/python/tests/conftest.py232-280):

Test framework selects appropriate DataSourceCreator based on offline store type
Creator generates test data and uploads to storage system
Creates DataSource instances pointing to test data
Test executes using universal test dataset
Creator cleans up resources in teardown

Universal Test Datasets (sdk/python/tests/integration/feature_repos/repo_configuration.py258-299):

customer_df: Customer profile features
driver_df: Driver statistics features
location_df: Location-based features
orders_df: Order transaction features
global_df: Global aggregation features

Sources: sdk/python/tests/integration/feature_repos/universal/data_source_creator.py1-50 sdk/python/tests/integration/feature_repos/repo_configuration.py315-362

Data Source Registry Operations

Data sources are stored in the registry for metadata management.

Registry Methods:

Method	Purpose
`list_data_sources()`	Retrieve all data sources for a project
`get_data_source()`	Retrieve a specific data source by name
`apply_data_source()`	Register or update a data source
`delete_data_source()`	Remove a data source from registry

List Data Sources (sdk/python/feast/feature_store.py462-477):

Get Data Source (sdk/python/feast/feature_store.py598-611):

Sources: sdk/python/feast/feature_store.py462-477 sdk/python/feast/feature_store.py598-611 sdk/python/feast/errors.py155-160

Best Practices

Data Source Naming

Use descriptive names that indicate the data content and source system
Prefix with source type for clarity: bq_customer_profiles, s3_driver_stats
Avoid name collisions across different source types

Name Validation (sdk/python/feast/errors.py86-90):

Timestamp Field Selection

Always use UTC timestamps for consistency
Prefer timestamp columns with millisecond or higher precision
Include created_timestamp_column when source data may have duplicates at the same event time

Field Mapping Strategy

Use field mapping sparingly - prefer consistent naming in source systems
Document all field mappings in feature repository
Validate that mapped fields exist in the source schema

Performance Considerations

For BigQuery:

Use table partitioning on timestamp fields
Specify billing_project_id separately from project_id for cost management
Use gcs_staging_location for large result sets

For Redshift:

Configure s3_staging_location for efficient data transfer
Use appropriate iam_role with minimal required permissions
Prefer provisioned clusters for consistent performance

For File Sources:

Use Parquet format for efficient columnar access
Partition files by date for better query performance
Consider DuckDB for large local datasets

Sources: sdk/python/feast/infra/offline_stores/bigquery.py85-123 sdk/python/feast/infra/offline_stores/redshift.py47-92

Data Sources

Overview

Data Source Base Class

Core Attributes

Batch Data Sources

BigQuery Data Source

Snowflake Data Source

Redshift Data Source

File-Based Data Sources

Stream Data Sources

Stream Source Types

PushSource

KafkaSource

KinesisSource

Data Source Configuration Properties

Timestamp Fields

Field Mapping

Data Source Lifecycle

Registration and Discovery

Data Source Validation

Usage in Feature Views

Data Source Retrieval Methods

Pull Latest from Table or Query

Pull All from Table or Query

Get Historical Features

Testing Data Sources

DataSourceCreator Pattern

Data Source Registry Operations

Best Practices

Data Source Naming

Timestamp Field Selection

Field Mapping Strategy

Performance Considerations

On this page

Data Sources

Overview

Data Source Base Class

Core Attributes

Batch Data Sources

BigQuery Data Source

Snowflake Data Source

Redshift Data Source

File-Based Data Sources

Stream Data Sources

Stream Source Types

PushSource

KafkaSource

KinesisSource

Data Source Configuration Properties

Timestamp Fields

Field Mapping

Data Source Lifecycle

Registration and Discovery

Data Source Validation

Usage in Feature Views

Data Source Retrieval Methods

Pull Latest from Table or Query

Pull All from Table or Query

Get Historical Features

Testing Data Sources

DataSourceCreator Pattern

Data Source Registry Operations

Best Practices

Data Source Naming

Timestamp Field Selection

Field Mapping Strategy

Performance Considerations

On this page