Skip to content

Commit 5c8b93c

Browse files
updated docs
Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
1 parent 9aceb7f commit 5c8b93c

8 files changed

Lines changed: 739 additions & 298 deletions

File tree

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -213,7 +213,7 @@ The list below contains the functionality that contributors are planning to deve
213213
* **Feature Engineering**
214214
* [x] On-demand Transformations (On Read) (Beta release. See [RFC](https://docs.google.com/document/d/1lgfIw0Drc65LpaxbUu49RCeJgMew547meSJttnUqz7c/edit#))
215215
* [x] Streaming Transformations (Alpha release. See [RFC](https://docs.google.com/document/d/1UzEyETHUaGpn0ap4G82DHluiCj7zEbrQLkJJkKSv4e8/edit))
216-
* [ ] Batch transformation (In progress. See [RFC](https://docs.google.com/document/d/1964OkzuBljifDvkV-0fakp2uaijnVzdwWNGdz7Vz50A/edit))
216+
* [x] Batch transformation (Completed via unified transformation system. See [Feature Transformation](https://docs.feast.dev/getting-started/architecture/feature-transformation))
217217
* [x] On-demand Transformations (On Write) (Beta release. See [GitHub Issue](https://github.com/feast-dev/feast/issues/4376))
218218
* **Streaming**
219219
* [x] [Custom streaming ingestion job support](https://docs.feast.dev/how-to-guides/customizing-feast/creating-a-custom-provider)

docs/getting-started/architecture/feature-transformation.md

Lines changed: 336 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -1,29 +1,101 @@
11
# Feature Transformation
22

3-
A *feature transformation* is a function that takes some set of input data and
4-
returns some set of output data. Feature transformations can happen on either raw data or derived data.
3+
A *feature transformation* is a function that takes some set of input data and returns some set of output data. Feature transformations can happen on either raw data or derived data. Feast provides a unified transformation system that allows you to define transformations once and apply them across different execution contexts.
4+
5+
## Unified Transformation System
6+
7+
Feast's unified transformation system centers around the `@transformation` decorator, which provides a single, consistent API for defining feature transformations. This decorator supports multiple execution modes, timing controls, and automatic feature view creation.
8+
9+
### Key Benefits
10+
11+
- **Single API**: Define transformations once using the `@transformation` decorator
12+
- **Multiple Modes**: Support for Python, Pandas, SQL, Spark, Ray, and Substrait transformations
13+
- **Execution Timing Control**: Choose when transformations run (on read, on write, batch, streaming)
14+
- **Training-Serving Consistency**: Dual registration ensures the same transformation logic is used for training and serving
15+
- **Automatic Feature View Creation**: Enhanced decorator can automatically create FeatureViews when provided with additional parameters
16+
17+
## Transformation Execution
518

6-
## Feature Transformation Engines
719
Feature transformations can be executed by three types of "transformation engines":
820

9-
1. The Feast Feature Server
10-
2. An Offline Store (e.g., Snowflake, BigQuery, DuckDB, Spark, etc.)
11-
3. [A Compute Engine](../../reference/compute-engine/README.md)
21+
1. **The Feast Feature Server**: Executes transformations during online feature retrieval
22+
2. **An Offline Store**: Executes transformations during historical feature retrieval (e.g., Snowflake, BigQuery, DuckDB, Spark)
23+
3. **[A Compute Engine](../../reference/compute-engine/README.md)**: Executes transformations during batch processing or materialization
24+
25+
The choice of execution engine depends on the transformation timing (`when` parameter) and mode (`mode` parameter).
1226

13-
The three transformation engines are coupled with the [communication pattern used for writes](write-patterns.md).
27+
## The @transformation Decorator
1428

15-
Importantly, this implies that different feature transformation code may be
16-
used under different transformation engines, so understanding the tradeoffs of
17-
when to use which transformation engine/communication pattern is extremely critical to
18-
the success of your implementation.
29+
The `@transformation` decorator is the primary API for defining feature transformations in Feast. It provides both backward compatibility with existing transformation patterns and new enhanced capabilities.
1930

20-
In general, we recommend transformation engines and network calls to be chosen by aligning it with what is most
21-
appropriate for the data producer, feature/model usage, and overall product.
31+
### Basic Usage (Backward Compatible)
32+
33+
```python
34+
from feast.transformation import transformation, TransformationMode
35+
36+
@transformation(mode=TransformationMode.PANDAS)
37+
def remove_extra_spaces(df: pd.DataFrame) -> pd.DataFrame:
38+
"""Remove extra spaces from name column."""
39+
return df.assign(name=df['name'].str.replace(r'\s+', ' ', regex=True))
40+
41+
# Use in a FeatureView
42+
feature_view = FeatureView(
43+
name="processed_drivers",
44+
entities=[driver_entity],
45+
source=driver_source,
46+
feature_transformation=remove_extra_spaces,
47+
...
48+
)
49+
```
2250

51+
### Enhanced Usage (New Capabilities)
2352

24-
## API
25-
### feature_transformation
26-
`feature_transformation` or `udf` are the core APIs for defining feature transformations in Feast. They allow you to specify custom logic that can be applied to the data during materialization or retrieval. Examples include:
53+
The decorator supports additional parameters that enable automatic FeatureView creation and advanced execution control:
54+
55+
```python
56+
from feast.transformation import transformation, TransformationTiming
57+
58+
@transformation(
59+
mode="pandas",
60+
when="on_read", # Execute during feature retrieval
61+
online=True, # Enable dual registration for training-serving consistency
62+
sources=[driver_hourly_stats_view],
63+
schema=[
64+
Field(name="conv_rate_adjusted", dtype=Float64),
65+
Field(name="efficiency_score", dtype=Float64)
66+
],
67+
entities=[driver_entity],
68+
name="driver_metrics_enhanced",
69+
description="Enhanced driver metrics with efficiency scoring"
70+
)
71+
def enhance_driver_metrics(df: pd.DataFrame) -> pd.DataFrame:
72+
"""Enhance driver metrics with additional calculations."""
73+
result = pd.DataFrame()
74+
result["conv_rate_adjusted"] = df["conv_rate"] * 1.1
75+
result["efficiency_score"] = df["conv_rate"] * df["acc_rate"] / df["avg_daily_trips"]
76+
return result
77+
78+
# This automatically creates:
79+
# 1. A FeatureView for batch/training use
80+
# 2. An OnDemandFeatureView for online serving (when online=True)
81+
```
82+
83+
### Parameters
84+
85+
The `@transformation` decorator supports several key parameters:
86+
87+
- **`mode`**: Transformation execution mode (`pandas`, `python`, `sql`, `spark`, `ray`, `substrait`)
88+
- **`when`**: Execution timing (`on_read`, `on_write`, `batch`, `streaming`)
89+
- **`online`**: Enable dual registration for training-serving consistency
90+
- **`sources`**: Source FeatureViews for automatic feature view creation
91+
- **`schema`**: Output schema when auto-creating feature views
92+
- **`entities`**: Entities for auto-created feature views
93+
94+
## Legacy API (Still Supported)
95+
96+
The existing transformation APIs continue to work alongside the new unified system:
97+
98+
### Using Transformation Objects
2799

28100
```python
29101
def remove_extra_spaces(df: DataFrame) -> DataFrame:
@@ -40,7 +112,9 @@ feature_view = FeatureView(
40112
...
41113
)
42114
```
43-
OR
115+
116+
### Using Generic Transformation Class
117+
44118
```python
45119
spark_transformation = Transformation(
46120
mode=TransformationMode.SPARK_SQL,
@@ -52,7 +126,9 @@ feature_view = FeatureView(
52126
...
53127
)
54128
```
55-
OR
129+
130+
### Basic Decorator Usage
131+
56132
```python
57133
@transformation(mode=TransformationMode.SPARK)
58134
def remove_extra_spaces_udf(df: pd.DataFrame) -> pd.DataFrame:
@@ -64,6 +140,248 @@ feature_view = FeatureView(
64140
)
65141
```
66142

143+
## Migration Examples: Old vs New Patterns
144+
145+
### Example 1: Stream Feature View Transformations
146+
147+
**Old Way - Stream Feature View with Transformation**
148+
149+
```python
150+
from feast import StreamFeatureView, Entity, Field
151+
from feast.data_source import KafkaSource
152+
from feast.types import Float64, Int64, String
153+
from feast.transformation.pandas_transformation import PandasTransformation
154+
155+
# Define entities and sources
156+
driver_entity = Entity(name="driver", join_keys=["driver_id"])
157+
158+
kafka_source = KafkaSource(
159+
name="driver_events",
160+
kafka_bootstrap_servers="localhost:9092",
161+
topic="driver_events",
162+
timestamp_field="event_timestamp",
163+
batch_source=FileSource(path="driver_events.parquet")
164+
)
165+
166+
# Define transformation function
167+
def calculate_driver_score(df: pd.DataFrame) -> pd.DataFrame:
168+
"""Calculate driver performance score."""
169+
df["driver_score"] = df["conv_rate"] * df["acc_rate"] * 100
170+
df["performance_tier"] = pd.cut(
171+
df["driver_score"],
172+
bins=[0, 30, 70, 100],
173+
labels=["low", "medium", "high"]
174+
)
175+
return df
176+
177+
# Create transformation object
178+
driver_transformation = PandasTransformation(
179+
udf=calculate_driver_score,
180+
udf_string="calculate driver score"
181+
)
182+
183+
# Create Stream Feature View
184+
driver_stream_fv = StreamFeatureView(
185+
name="driver_stream_features",
186+
entities=[driver_entity],
187+
schema=[
188+
Field(name="conv_rate", dtype=Float64),
189+
Field(name="acc_rate", dtype=Float64),
190+
Field(name="driver_score", dtype=Float64),
191+
Field(name="performance_tier", dtype=String),
192+
],
193+
source=kafka_source,
194+
feature_transformation=driver_transformation,
195+
)
196+
```
197+
198+
**New Way - Unified Transformation with Streaming**
199+
200+
```python
201+
from feast.transformation import transformation
202+
203+
# Define the same transformation with unified decorator
204+
@transformation(
205+
mode="pandas",
206+
when="streaming", # Execute in streaming context
207+
online=True, # Enable dual registration
208+
sources=[kafka_source],
209+
schema=[
210+
Field(name="driver_score", dtype=Float64),
211+
Field(name="performance_tier", dtype=String),
212+
],
213+
entities=[driver_entity],
214+
name="driver_stream_features",
215+
description="Real-time driver performance scoring"
216+
)
217+
def calculate_driver_score_unified(df: pd.DataFrame) -> pd.DataFrame:
218+
"""Calculate driver performance score - unified approach."""
219+
result = pd.DataFrame()
220+
result["driver_score"] = df["conv_rate"] * df["acc_rate"] * 100
221+
result["performance_tier"] = pd.cut(
222+
result["driver_score"],
223+
bins=[0, 30, 70, 100],
224+
labels=["low", "medium", "high"]
225+
)
226+
return result
227+
228+
# Automatically creates both StreamFeatureView and OnDemandFeatureView
229+
```
230+
231+
### Example 2: On Demand Feature View Transformations
232+
233+
**Old Way - Separate ODFV Definition**
234+
235+
```python
236+
from feast.on_demand_feature_view import on_demand_feature_view
237+
from feast import RequestSource, Field
238+
from feast.types import Float64, Int64
239+
240+
# Define request source for real-time data
241+
request_source = RequestSource(
242+
name="driver_request",
243+
schema=[
244+
Field(name="current_temp", dtype=Float64),
245+
Field(name="time_of_day", dtype=Int64),
246+
]
247+
)
248+
249+
# Define ODFV with transformation
250+
@on_demand_feature_view(
251+
sources=[driver_hourly_stats_view, request_source],
252+
schema=[
253+
Field(name="weather_adjusted_score", dtype=Float64),
254+
Field(name="time_adjusted_conv_rate", dtype=Float64),
255+
],
256+
mode="pandas",
257+
write_to_online_store=True # Apply on write
258+
)
259+
def weather_adjusted_features(features_df: pd.DataFrame) -> pd.DataFrame:
260+
"""Adjust features based on weather and time."""
261+
df = pd.DataFrame()
262+
263+
# Weather adjustment
264+
weather_factor = 1.0 + (features_df["current_temp"] - 70) / 100
265+
df["weather_adjusted_score"] = features_df["conv_rate"] * weather_factor
266+
267+
# Time of day adjustment
268+
time_factor = np.where(
269+
(features_df["time_of_day"] >= 6) & (features_df["time_of_day"] <= 18),
270+
1.1, # Daytime boost
271+
0.9 # Nighttime reduction
272+
)
273+
df["time_adjusted_conv_rate"] = features_df["conv_rate"] * time_factor
274+
275+
return df
276+
```
277+
278+
**New Way - Unified Transformation**
279+
280+
```python
281+
from feast.transformation import transformation
282+
283+
@transformation(
284+
mode="pandas",
285+
when="on_write", # Apply during data ingestion
286+
online=True, # Enable dual registration
287+
sources=[driver_hourly_stats_view, request_source],
288+
schema=[
289+
Field(name="weather_adjusted_score", dtype=Float64),
290+
Field(name="time_adjusted_conv_rate", dtype=Float64),
291+
],
292+
entities=[driver_entity],
293+
name="contextual_driver_features",
294+
description="Driver features adjusted for weather and time context"
295+
)
296+
def weather_adjusted_features_unified(df: pd.DataFrame) -> pd.DataFrame:
297+
"""Adjust features based on weather and time - unified approach."""
298+
result = pd.DataFrame()
299+
300+
# Weather adjustment
301+
weather_factor = 1.0 + (df["current_temp"] - 70) / 100
302+
result["weather_adjusted_score"] = df["conv_rate"] * weather_factor
303+
304+
# Time of day adjustment
305+
import numpy as np
306+
time_factor = np.where(
307+
(df["time_of_day"] >= 6) & (df["time_of_day"] <= 18),
308+
1.1, # Daytime boost
309+
0.9 # Nighttime reduction
310+
)
311+
result["time_adjusted_conv_rate"] = df["conv_rate"] * time_factor
312+
313+
return result
314+
315+
# This creates:
316+
# 1. A FeatureView for batch processing
317+
# 2. An OnDemandFeatureView for online serving
318+
# Both use the same transformation logic!
319+
```
320+
321+
### Example 3: Training-Serving Consistency
322+
323+
**Old Way - Duplicate Logic**
324+
325+
```python
326+
# Training pipeline transformation
327+
def training_feature_engineering(df: pd.DataFrame) -> pd.DataFrame:
328+
"""Feature engineering for training."""
329+
df["interaction_score"] = df["conv_rate"] * df["acc_rate"]
330+
df["normalized_trips"] = df["avg_daily_trips"] / df["avg_daily_trips"].max()
331+
return df
332+
333+
# Separate serving transformation (risk of skew!)
334+
@on_demand_feature_view(
335+
sources=[driver_stats_view],
336+
schema=[
337+
Field(name="interaction_score", dtype=Float64),
338+
Field(name="normalized_trips", dtype=Float64),
339+
]
340+
)
341+
def serving_feature_engineering(features_df: pd.DataFrame) -> pd.DataFrame:
342+
"""Feature engineering for serving - DUPLICATE LOGIC!"""
343+
df = pd.DataFrame()
344+
df["interaction_score"] = features_df["conv_rate"] * features_df["acc_rate"]
345+
df["normalized_trips"] = features_df["avg_daily_trips"] / 100 # Hardcoded max!
346+
return df
347+
```
348+
349+
**New Way - Single Source of Truth**
350+
351+
```python
352+
@transformation(
353+
mode="pandas",
354+
when="on_read", # Fresh calculations
355+
online=True, # Dual registration ensures consistency
356+
sources=[driver_stats_view],
357+
schema=[
358+
Field(name="interaction_score", dtype=Float64),
359+
Field(name="normalized_trips", dtype=Float64),
360+
],
361+
entities=[driver_entity],
362+
name="consistent_driver_features"
363+
)
364+
def unified_feature_engineering(df: pd.DataFrame) -> pd.DataFrame:
365+
"""Single transformation for both training and serving."""
366+
result = pd.DataFrame()
367+
result["interaction_score"] = df["conv_rate"] * df["acc_rate"]
368+
result["normalized_trips"] = df["avg_daily_trips"] / df["avg_daily_trips"].max()
369+
return result
370+
371+
# Same logic used for:
372+
# - Historical feature retrieval (training)
373+
# - Online feature serving (inference)
374+
# - Batch materialization
375+
```
376+
377+
### Benefits of the New Approach
378+
379+
1. **Reduced Code Duplication**: Single transformation definition vs multiple implementations
380+
2. **Training-Serving Consistency**: Automatic dual registration eliminates skew
381+
3. **Simplified Management**: One decorator handles all transformation contexts
382+
4. **Better Maintainability**: Changes only need to be made in one place
383+
5. **Flexible Execution**: Easy to change timing (`when` parameter) without rewriting logic
384+
67385
### Aggregation
68386
Aggregation is builtin API for defining batch or streamable aggregations on data. It allows you to specify how to aggregate data over a time window, such as calculating the average or sum of a feature over a specified period. Examples include:
69387
```python

0 commit comments

Comments
 (0)