-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Description
Note: Users can push streaming features to the online store via push sources
Note: RFC can be found here.
See also: Feast Slack: #feast-feature-transformations
Requirements
Problem 1
Data scientists cannot meaningfully author event-backed features that will be fresh at serving time.
Often, right now data scientists will define a feature, and then work with engineers to build that streaming pipeline to push to the online store. There is a tight coupling and engineering bottleneck to reach production.
Problem 2
Users cannot guarantee consistent feature transformations across batch + streaming contexts. If they author a SQL transformation in batch, and a PySpark transformation for streaming, they may have inconsistencies.
- This problem is also related to versioning transformation logic within Feast. If both batch + stream transformation logic is present, it is easier to ensure they are consistent. An ML platform team can also build a custom feature view that only requires one transformation to be defined, but map to different semantics in batch vs streaming.
Problem 3
Users cannot version transformation logic and tie it to downstream models (a lineage problem). It's useful to see what models may be impacted by changing some stream transformation / aggregation logic, and perhaps allow for A/B experiments using different transformations.
Potential solution
Types of transformations
We should support defining both:
- stateless transformations (e.g. row level transforms / mapping)
- stateful transformations (e.g. windowed transformations)
Data scientist developer experience
It should be easy for a data scientist to correctly author stream transformations (including aggregations).
- One way is to introduce a higher level DSL for aggregation logic.
- Another way is to allow streaming some sample data in, applying the transformation, and seeing the input vs output.
Example feature view
@stream_feature_view(
name="driver_stats",
entities=[driver],
mode="spark_sql",
aggregations=[
Aggregation(
column='distance',
function='mean',
time_windows=[timedelta(days=7)]
)
],
aggregation_interval=timedelta(days=1),
source=driver_stats_stream_source,
)
def driver_stats(driver_stats_stream_source):
return f'''
SELECT
driver_id,
distance as most_recent_trip_distance,
duration as most_recent_trip_duration,
event_timestamp,
FROM
{driver_stats_stream_source}
'''Alternatives considered
- Kicking off long running jobs that ingests events, transforms (Spark, Flink, Bytewax, etc), and writes to the online stores.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status