Data sources define where Feast reads feature data from. They specify the location and schema of raw feature data, which can reside in various storage systems (data warehouses, object storage, streaming platforms). Data sources are the foundation for defining Feature Views and enable Feast to perform point-in-time correct joins for training data generation and to materialize features to the online store.
For information about how data sources are used in offline stores for historical retrieval, see Offline Stores. For information about how Feature Views use data sources, see Feature Views and Services.
Sources: sdk/python/feast/data_source.py1-100 sdk/python/feast/feature_store.py49-55 sdk/python/feast/repo_operations.py136-220
The DataSource class is the abstract base class for all data sources in Feast. It defines the common interface that all data source implementations must follow.
| Attribute | Type | Purpose |
|---|---|---|
name | str | Unique identifier for the data source |
timestamp_field | str | Column name containing event timestamps |
created_timestamp_column | Optional[str] | Column name containing creation timestamps (for deduplication) |
field_mapping | Dict[str, str] | Maps physical column names to feature names |
tags | Dict[str, str] | User-defined metadata tags |
The timestamp_field is required for all batch data sources and is used for point-in-time correct joins. The created_timestamp_column is optional and provides a tiebreaker when multiple records have the same event timestamp.
Data Source Registration Flow:
Sources: sdk/python/feast/repo_operations.py114-220 sdk/python/feast/feature_store.py944-1075
Batch data sources represent static datasets used for training data generation and batch materialization. They support point-in-time correct joins through the timestamp fields.
BigQuerySource reads data from Google BigQuery tables or queries.
Configuration:
Key Implementation Details:
project.dataset.table) or a SQL queryExample Query Generation (simplified from sdk/python/feast/infra/offline_stores/bigquery.py164-176):
SELECT {field_string}
FROM (
SELECT {field_string},
ROW_NUMBER() OVER(PARTITION BY {join_keys} ORDER BY {timestamps} DESC) AS _feast_row
FROM {table_or_query}
WHERE {timestamp_field} BETWEEN TIMESTAMP('{start}') AND TIMESTAMP('{end}')
)
WHERE _feast_row = 1
This query pattern retrieves the latest feature values within a time range, using ROW_NUMBER() for deduplication.
Sources: sdk/python/feast/infra/offline_stores/bigquery.py85-456 sdk/python/tests/integration/feature_repos/universal/data_sources/bigquery.py1-100
SnowflakeSource reads data from Snowflake tables or queries.
Configuration:
database: Snowflake database nameschema: Snowflake schema nametable or query: Table reference or SQL queryKey Features:
Sources: sdk/python/tests/integration/feature_repos/universal/data_sources/snowflake.py1-100
RedshiftSource reads data from AWS Redshift clusters.
Configuration Options:
Query Pattern (sdk/python/feast/infra/offline_stores/redshift.py133-148):
ROW_NUMBER() window function for deduplicationSources: sdk/python/feast/infra/offline_stores/redshift.py47-93 sdk/python/tests/integration/feature_repos/universal/data_sources/redshift.py1-100
FileSource reads data from file systems, supporting various formats.
Supported Formats:
| Format | Class | Use Case |
|---|---|---|
| Parquet | ParquetFormat | Default format, efficient columnar storage |
| Delta | DeltaFormat | Delta Lake tables with versioning |
| CSV | (via Pandas) | Legacy data, less efficient |
Storage Locations:
s3://bucket/path)gs://bucket/path)wasbs://...)Offline Store Variants:
File Data Source Configuration (sdk/python/tests/integration/feature_repos/universal/data_sources/file.py56-79):
Sources: sdk/python/tests/integration/feature_repos/universal/data_sources/file.py45-95 sdk/python/feast/infra/offline_stores/dask.py1-50
Stream data sources enable real-time feature ingestion from streaming platforms. They are used with StreamFeatureView to define features that require low-latency updates.
PushSource allows applications to push features directly to Feast via HTTP/gRPC endpoints.
Characteristics:
batch_source for historical dataONLINE (push to online store only) or OFFLINE (push to offline store) modesData Source Registration with PushSource (sdk/python/feast/repo_operations.py147-156):
Sources: sdk/python/feast/data_source.py50-54 sdk/python/feast/feature_store.py49-55
KafkaSource consumes feature data from Apache Kafka topics.
Configuration:
kafka_bootstrap_servers: Kafka broker addressestopic: Kafka topic namemessage_format: Serialization format (Avro, JSON, Protobuf)batch_source: Companion batch source for historical dataSources: sdk/python/feast/data_source.py51
KinesisSource consumes feature data from AWS Kinesis streams.
Configuration:
stream_name: Kinesis stream nameregion: AWS regionbatch_source: Companion batch source for historical dataSources: sdk/python/feast/data_source.py52
Timestamp fields are critical for point-in-time correctness in Feast.
Timestamp Field Semantics:
Inference Logic (sdk/python/feast/inference.py1-100):
timestamp_field not specified, Feast attempts to infer it from data source schemaevent_timestamp, ts, timestamp)Sources: sdk/python/feast/inference.py70-73 sdk/python/feast/infra/offline_stores/offline_utils.py28-44
Field mapping allows renaming columns from the physical data source to feature names.
Use Cases:
Example:
Implementation (sdk/python/feast/utils.py200-250):
Sources: sdk/python/tests/integration/feature_repos/universal/data_sources/file.py77-78
Key Files Involved:
Sources: sdk/python/feast/repo_operations.py136-156 sdk/python/feast/feature_store.py851-859
Data sources are validated to ensure compatibility with the configured provider and offline store.
Validation Steps:
Provider-Specific Validation:
Sources: sdk/python/feast/repo_operations.py236-240 sdk/python/feast/feature_store.py346-351
Data sources are consumed by Feature Views to define the source of feature data.
Integration Pattern:
FeatureView Validation (sdk/python/feast/feature_store.py690-701):
batch_source timestamp fields match FeatureView expectationsSources: sdk/python/feast/feature_store.py665-727 sdk/python/feast/feature_view.py1-100
Offline stores provide different retrieval methods for data sources:
Retrieves the most recent feature values within a time range for materialization.
Method Signature:
Query Pattern (common across BigQuery, Redshift, Snowflake):
Sources: sdk/python/feast/infra/offline_stores/bigquery.py127-183 sdk/python/feast/infra/offline_stores/redshift.py97-148
Retrieves all feature values within a time range (no deduplication).
Use Cases:
Implementation Difference:
ROW_NUMBER() window functionSources: sdk/python/feast/infra/offline_stores/bigquery.py186-232
Performs point-in-time correct joins between entity dataframe and feature data.
Complex Join Logic:
Sources: sdk/python/feast/infra/offline_stores/bigquery.py235-340 sdk/python/feast/infra/offline_stores/offline_utils.py1-100
Feast provides a universal testing framework for data sources that works across different storage backends.
Test Data Flow (sdk/python/tests/conftest.py232-280):
DataSourceCreator based on offline store typeDataSource instances pointing to test dataUniversal Test Datasets (sdk/python/tests/integration/feature_repos/repo_configuration.py258-299):
customer_df: Customer profile featuresdriver_df: Driver statistics featureslocation_df: Location-based featuresorders_df: Order transaction featuresglobal_df: Global aggregation featuresSources: sdk/python/tests/integration/feature_repos/universal/data_source_creator.py1-50 sdk/python/tests/integration/feature_repos/repo_configuration.py315-362
Data sources are stored in the registry for metadata management.
Registry Methods:
| Method | Purpose |
|---|---|
list_data_sources() | Retrieve all data sources for a project |
get_data_source() | Retrieve a specific data source by name |
apply_data_source() | Register or update a data source |
delete_data_source() | Remove a data source from registry |
List Data Sources (sdk/python/feast/feature_store.py462-477):
Get Data Source (sdk/python/feast/feature_store.py598-611):
Sources: sdk/python/feast/feature_store.py462-477 sdk/python/feast/feature_store.py598-611 sdk/python/feast/errors.py155-160
bq_customer_profiles, s3_driver_statsName Validation (sdk/python/feast/errors.py86-90):
created_timestamp_column when source data may have duplicates at the same event timeFor BigQuery:
billing_project_id separately from project_id for cost managementgcs_staging_location for large result setsFor Redshift:
s3_staging_location for efficient data transferiam_role with minimal required permissionsFor File Sources:
Sources: sdk/python/feast/infra/offline_stores/bigquery.py85-123 sdk/python/feast/infra/offline_stores/redshift.py47-92
Refresh this wiki
This wiki was recently refreshed. Please wait 4 days to refresh again.