Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -1,9 +1,11 @@
# Batch Materialization Engine

Note: The materialization engine is not constructed via unified compute engine interface.

A batch materialization engine is a component of Feast that's responsible for moving data from the offline store into the online store.

A materialization engine abstracts over specific technologies or frameworks that are used to materialize data. It allows users to use a pure local serialized approach (which is the default LocalMaterializationEngine), or delegates the materialization to seperate components (e.g. AWS Lambda, as implemented by the the LambdaMaterializaionEngine).
A materialization engine abstracts over specific technologies or frameworks that are used to materialize data. It allows users to use a pure local serialized approach (which is the default LocalComputeEngine), or delegates the materialization to seperate components (e.g. AWS Lambda, as implemented by the the LambdaComputeEngine).

If the built-in engines are not sufficient, you can create your own custom materialization engine. Please see [this guide](../../how-to-guides/customizing-feast/creating-a-custom-materialization-engine.md) for more details.
If the built-in engines are not sufficient, you can create your own custom materialization engine. Please see [this guide](../../how-to-guides/customizing-feast/creating-a-custom-compute-engine.md) for more details.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are other couple of reference to creating-a-custom-materialization-engine.md which needs to change as well
https://github.com/search?q=repo%3Afeast-dev%2Ffeast%20creating-a-custom-materialization-engine.md&type=code

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For sure, let me update in the next PR


Please see [feature\_store.yaml](../../reference/feature-repository/feature-store-yaml.md#overview) for configuring engines.
Original file line number Diff line number Diff line change
@@ -1,24 +1,24 @@
# Adding a custom batch materialization engine
# Adding a custom compute engine

### Overview

Feast batch materialization operations (`materialize` and `materialize-incremental`) execute through a `BatchMaterializationEngine`.
Feast batch materialization operations (`materialize` and `materialize-incremental`), and get_historical_features are executed through a `ComputeEngine`.

Custom batch materialization engines allow Feast users to extend Feast to customize the materialization process. Examples include:
Custom batch compute engines allow Feast users to extend Feast to customize the materialization and get_historical_features process. Examples include:

* Setting up custom materialization-specific infrastructure during `feast apply` (e.g. setting up Spark clusters or Lambda Functions)
* Launching custom batch ingestion (materialization) jobs (Spark, Beam, AWS Lambda)
* Tearing down custom materialization-specific infrastructure during `feast teardown` (e.g. tearing down Spark clusters, or deleting Lambda Functions)

Feast comes with built-in materialization engines, e.g, `LocalMaterializationEngine`, and an experimental `LambdaMaterializationEngine`. However, users can develop their own materialization engines by creating a class that implements the contract in the [BatchMaterializationEngine class](https://github.com/feast-dev/feast/blob/6d7b38a39024b7301c499c20cf4e7aef6137c47c/sdk/python/feast/infra/materialization/batch\_materialization\_engine.py#L72).
Feast comes with built-in materialization engines, e.g, `LocalComputeEngine`, and an experimental `LambdaComputeEngine`. However, users can develop their own compute engines by creating a class that implements the contract in the [ComputeEngine class](https://github.com/feast-dev/feast/blob/85514edbb181df083e6a0d24672c00f0624dcaa3/sdk/python/feast/infra/compute_engines/base.py#L19).

### Guide

The fastest way to add custom logic to Feast is to extend an existing materialization engine. The most generic engine is the `LocalMaterializationEngine` which contains no cloud-specific logic. The guide that follows will extend the `LocalProvider` with operations that print text to the console. It is up to you as a developer to add your custom code to the engine methods, but the guide below will provide the necessary scaffolding to get you started.
The fastest way to add custom logic to Feast is to implement the ComputeEngine. The guide that follows will extend the `LocalProvider` with operations that print text to the console. It is up to you as a developer to add your custom code to the engine methods, but the guide below will provide the necessary scaffolding to get you started.

#### Step 1: Define an Engine class

The first step is to define a custom materialization engine class. We've created the `MyCustomEngine` below. This python file can be placed in your `feature_repo` directory if you're following the Quickstart guide.
The first step is to define a custom compute engine class. We've created the `MyCustomEngine` below. This python file can be placed in your `feature_repo` directory if you're following the Quickstart guide.

```python
from typing import List, Sequence, Union
Expand All @@ -27,14 +27,16 @@ from feast.entity import Entity
from feast.feature_view import FeatureView
from feast.batch_feature_view import BatchFeatureView
from feast.stream_feature_view import StreamFeatureView
from feast.infra.materialization.local_engine import LocalMaterializationJob, LocalMaterializationEngine
from feast.infra.common.retrieval_task import HistoricalRetrievalTask
from feast.infra.compute_engines.local.job import LocalMaterializationJob
from feast.infra.compute_engines.base import ComputeEngine
from feast.infra.common.materialization_job import MaterializationTask
from feast.infra.offline_stores.offline_store import OfflineStore
from feast.infra.offline_stores.offline_store import OfflineStore, RetrievalJob
from feast.infra.online_stores.online_store import OnlineStore
from feast.repo_config import RepoConfig


class MyCustomEngine(LocalMaterializationEngine):
class MyCustomEngine(ComputeEngine):
def __init__(
self,
*,
Expand Down Expand Up @@ -80,9 +82,13 @@ class MyCustomEngine(LocalMaterializationEngine):
)
for task in tasks
]

def get_historical_features(self, task: HistoricalRetrievalTask) -> RetrievalJob:
raise NotImplementedError
```

Notice how in the above engine we have only overwritten two of the methods on the `LocalMaterializatinEngine`, namely `update` and `materialize`. These two methods are convenient to replace if you are planning to launch custom batch jobs.
Notice how in the above engine we have only overwritten two of the methods on the `LocalComputeEngine`, namely `update` and `materialize`. These two methods are convenient to replace if you are planning to launch custom batch jobs.
If you want to use the compute to execute the get_historical_features method, you will need to implement the `get_historical_features` method as well.

#### Step 2: Configuring Feast to use the engine

Expand Down
2 changes: 1 addition & 1 deletion sdk/python/feast/batch_feature_view.py
Original file line number Diff line number Diff line change
Expand Up @@ -79,7 +79,7 @@ def __init__(
ttl: Optional[timedelta] = None,
tags: Optional[Dict[str, str]] = None,
online: bool = False,
offline: bool = True,
offline: bool = False,
Copy link
Collaborator Author

@HaoXuAI HaoXuAI Jun 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be backward compatible for testing

description: str = "",
owner: str = "",
schema: Optional[List[Field]] = None,
Expand Down
3 changes: 2 additions & 1 deletion sdk/python/feast/infra/common/materialization_job.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,8 @@ class MaterializationTask:
feature_view: Union[BatchFeatureView, StreamFeatureView, FeatureView]
start_time: datetime
end_time: datetime
tqdm_builder: Callable[[int], tqdm]
only_latest: bool = True
tqdm_builder: Union[None, Callable[[int], tqdm]] = None


class MaterializationJobStatus(enum.Enum):
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -3,13 +3,12 @@
import logging
from concurrent.futures import ThreadPoolExecutor, wait
from dataclasses import dataclass
from datetime import datetime
from typing import Callable, List, Literal, Optional, Sequence, Union
from typing import Literal, Optional, Sequence, Union

import boto3
import pyarrow as pa
from botocore.config import Config
from pydantic import StrictStr
from tqdm import tqdm

from feast import utils
from feast.batch_feature_view import BatchFeatureView
Expand All @@ -21,9 +20,8 @@
MaterializationJobStatus,
MaterializationTask,
)
from feast.infra.materialization.batch_materialization_engine import (
BatchMaterializationEngine,
)
from feast.infra.common.retrieval_task import HistoricalRetrievalTask
from feast.infra.compute_engines.base import ComputeEngine
from feast.infra.offline_stores.offline_store import OfflineStore
from feast.infra.online_stores.online_store import OnlineStore
from feast.infra.registry.base_registry import BaseRegistry
Expand All @@ -40,8 +38,8 @@
logger = logging.getLogger(__name__)


class LambdaMaterializationEngineConfig(FeastConfigBaseModel):
"""Batch Materialization Engine config for lambda based engine"""
class LambdaComputeEngineConfig(FeastConfigBaseModel):
"""Batch Compute Engine config for lambda based engine"""

type: Literal["lambda"] = "lambda"
""" Type selector"""
Expand Down Expand Up @@ -82,11 +80,18 @@ def url(self) -> Optional[str]:
return None


class LambdaMaterializationEngine(BatchMaterializationEngine):
class LambdaComputeEngine(ComputeEngine):
"""
WARNING: This engine should be considered "Alpha" functionality.
"""

def get_historical_features(
self, registry: BaseRegistry, task: HistoricalRetrievalTask
) -> pa.Table:
raise NotImplementedError(
"Lambda Compute Engine does not support get_historical_features"
)

def update(
self,
project: str,
Expand Down Expand Up @@ -160,30 +165,14 @@ def __init__(
config = Config(read_timeout=DEFAULT_TIMEOUT + 10)
self.lambda_client = boto3.client("lambda", config=config)

def materialize(
self, registry, tasks: List[MaterializationTask]
) -> List[MaterializationJob]:
return [
self._materialize_one(
registry,
task.feature_view,
task.start_time,
task.end_time,
task.project,
task.tqdm_builder,
)
for task in tasks
]

def _materialize_one(
self,
registry: BaseRegistry,
feature_view: Union[BatchFeatureView, StreamFeatureView, FeatureView],
start_date: datetime,
end_date: datetime,
project: str,
tqdm_builder: Callable[[int], tqdm],
self, registry: BaseRegistry, task: MaterializationTask, **kwargs
):
feature_view = task.feature_view
start_date = task.start_time
end_date = task.end_time
project = task.project

entities = []
for entity_name in feature_view.entities:
entities.append(registry.get_entity(entity_name, project))
Expand Down
120 changes: 106 additions & 14 deletions sdk/python/feast/infra/compute_engines/base.py
Original file line number Diff line number Diff line change
@@ -1,63 +1,130 @@
from abc import ABC
from typing import Union
from abc import ABC, abstractmethod
from typing import List, Optional, Sequence, Union

import pyarrow as pa

from feast import RepoConfig
from feast.batch_feature_view import BatchFeatureView
from feast.entity import Entity
from feast.feature_view import FeatureView
from feast.infra.common.materialization_job import (
MaterializationJob,
MaterializationTask,
)
from feast.infra.common.retrieval_task import HistoricalRetrievalTask
from feast.infra.compute_engines.dag.context import ColumnInfo, ExecutionContext
from feast.infra.offline_stores.offline_store import OfflineStore
from feast.infra.offline_stores.offline_store import OfflineStore, RetrievalJob
from feast.infra.online_stores.online_store import OnlineStore
from feast.infra.registry.registry import Registry
from feast.infra.registry.base_registry import BaseRegistry
from feast.on_demand_feature_view import OnDemandFeatureView
from feast.stream_feature_view import StreamFeatureView
from feast.utils import _get_column_names


class ComputeEngine(ABC):
"""
The interface that Feast uses to control the compute system that handles materialization and get_historical_features.
The interface that Feast uses to control to compute system that handles materialization and get_historical_features.
Each engine must implement:
- materialize(): to generate and persist features
- get_historical_features(): to perform point-in-time correct joins
- get_historical_features(): to perform historical retrieval of features
Engines should use FeatureBuilder and DAGNode abstractions to build modular, pluggable workflows.
"""

def __init__(
self,
*,
registry: Registry,
repo_config: RepoConfig,
offline_store: OfflineStore,
online_store: OnlineStore,
**kwargs,
):
self.registry = registry
self.repo_config = repo_config
self.offline_store = offline_store
self.online_store = online_store

def materialize(self, task: MaterializationTask) -> MaterializationJob:
raise NotImplementedError
@abstractmethod
def update(
self,
project: str,
views_to_delete: Sequence[
Union[BatchFeatureView, StreamFeatureView, FeatureView]
],
views_to_keep: Sequence[
Union[BatchFeatureView, StreamFeatureView, FeatureView, OnDemandFeatureView]
],
entities_to_delete: Sequence[Entity],
entities_to_keep: Sequence[Entity],
):
"""
Prepares cloud resources required for batch materialization for the specified set of Feast objects.

Args:
project: Feast project to which the objects belong.
views_to_delete: Feature views whose corresponding infrastructure should be deleted.
views_to_keep: Feature views whose corresponding infrastructure should not be deleted, and
may need to be updated.
entities_to_delete: Entities whose corresponding infrastructure should be deleted.
entities_to_keep: Entities whose corresponding infrastructure should not be deleted, and
may need to be updated.
"""
pass

@abstractmethod
def teardown_infra(
self,
project: str,
fvs: Sequence[Union[BatchFeatureView, StreamFeatureView, FeatureView]],
entities: Sequence[Entity],
):
"""
Tears down all cloud resources used by the materialization engine for the specified set of Feast objects.

Args:
project: Feast project to which the objects belong.
fvs: Feature views whose corresponding infrastructure should be deleted.
entities: Entities whose corresponding infrastructure should be deleted.
"""
pass

def get_historical_features(self, task: HistoricalRetrievalTask) -> pa.Table:
def materialize(
self,
registry: BaseRegistry,
tasks: Union[MaterializationTask, List[MaterializationTask]],
**kwargs,
) -> List[MaterializationJob]:
if isinstance(tasks, MaterializationTask):
tasks = [tasks]
return [self._materialize_one(registry, task, **kwargs) for task in tasks]

def _materialize_one(
self,
registry: BaseRegistry,
task: MaterializationTask,
**kwargs,
) -> MaterializationJob:
raise NotImplementedError(
"Materialization is not implemented for this compute engine."
)

def get_historical_features(
self, registry: BaseRegistry, task: HistoricalRetrievalTask
) -> Union[RetrievalJob, pa.Table]:
raise NotImplementedError

def get_execution_context(
self,
registry: BaseRegistry,
task: Union[MaterializationTask, HistoricalRetrievalTask],
) -> ExecutionContext:
entity_defs = [
self.registry.get_entity(name, task.project)
registry.get_entity(name, task.project)
for name in task.feature_view.entities
]
entity_df = None
if hasattr(task, "entity_df") and task.entity_df is not None:
entity_df = task.entity_df

column_info = self.get_column_info(task)
column_info = self.get_column_info(registry, task)
return ExecutionContext(
project=task.project,
repo_config=self.repo_config,
Expand All @@ -70,14 +137,39 @@ def get_execution_context(

def get_column_info(
self,
registry: BaseRegistry,
task: Union[MaterializationTask, HistoricalRetrievalTask],
) -> ColumnInfo:
entities = []
for entity_name in task.feature_view.entities:
entities.append(registry.get_entity(entity_name, task.project))

join_keys, feature_cols, ts_col, created_ts_col = _get_column_names(
task.feature_view, self.registry.list_entities(task.project)
task.feature_view, entities
)
field_mapping = self.get_field_mapping(task.feature_view)

return ColumnInfo(
join_keys=join_keys,
feature_cols=feature_cols,
ts_col=ts_col,
created_ts_col=created_ts_col,
field_mapping=field_mapping,
)

def get_field_mapping(
self, feature_view: Union[BatchFeatureView, StreamFeatureView, FeatureView]
) -> Optional[dict]:
"""
Get the field mapping for a feature view.
Args:
feature_view: The feature view to get the field mapping for.

Returns:
A dictionary mapping field names to column names.
"""
if feature_view.stream_source:
return feature_view.stream_source.field_mapping
if feature_view.batch_source:
return feature_view.batch_source.field_mapping
return None
Loading
Loading