Skip to content

Stream transformations #2597

@adchia

Description

@adchia

Note: Users can push streaming features to the online store via push sources
Note: RFC can be found here.
See also: Feast Slack: #feast-feature-transformations

Requirements

Problem 1

Data scientists cannot meaningfully author event-backed features that will be fresh at serving time.

Often, right now data scientists will define a feature, and then work with engineers to build that streaming pipeline to push to the online store. There is a tight coupling and engineering bottleneck to reach production.

Problem 2

Users cannot guarantee consistent feature transformations across batch + streaming contexts. If they author a SQL transformation in batch, and a PySpark transformation for streaming, they may have inconsistencies.

  • This problem is also related to versioning transformation logic within Feast. If both batch + stream transformation logic is present, it is easier to ensure they are consistent. An ML platform team can also build a custom feature view that only requires one transformation to be defined, but map to different semantics in batch vs streaming.

Problem 3

Users cannot version transformation logic and tie it to downstream models (a lineage problem). It's useful to see what models may be impacted by changing some stream transformation / aggregation logic, and perhaps allow for A/B experiments using different transformations.

Potential solution

Types of transformations

We should support defining both:

  • stateless transformations (e.g. row level transforms / mapping)
  • stateful transformations (e.g. windowed transformations)

Data scientist developer experience

It should be easy for a data scientist to correctly author stream transformations (including aggregations).

  • One way is to introduce a higher level DSL for aggregation logic.
  • Another way is to allow streaming some sample data in, applying the transformation, and seeing the input vs output.

Example feature view

@stream_feature_view(
    name="driver_stats",
    entities=[driver],
    mode="spark_sql",
    aggregations=[
        Aggregation(
            column='distance',
            function='mean', 
            time_windows=[timedelta(days=7)]   
         )
    ],
    aggregation_interval=timedelta(days=1),
    source=driver_stats_stream_source,
)
def driver_stats(driver_stats_stream_source):
    return f'''
        SELECT
            driver_id,
            distance as most_recent_trip_distance,
            duration as most_recent_trip_duration,
            event_timestamp,
        FROM
            {driver_stats_stream_source}
        '''

Alternatives considered

  • Kicking off long running jobs that ingests events, transforms (Spark, Flink, Bytewax, etc), and writes to the online stores.

Metadata

Metadata

Labels

kind/featureNew feature or requestkind/projectA top level project to be tracked in GitHub Projectspriority/p0Highest prioritywontfixThis will not be worked on

Type

No type

Projects

Status

Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions