Skip to content

Unified Push API to offline and online stores #2732

@adchia

Description

@adchia

Problem

It's difficult to keep streaming features (i.e. from Kafka + Spark Streaming) or push features (e.g. computed at request time) consistently available at training + serving time.

With streaming features today, users would need to either:

  1. write transformed features to both offline and online stores:
    • stream 1 -> transform -> stream 2
    • stream 2 -> write to offline store
    • stream 2 -> write to online store via feature_store.push(df, push_source)
  2. use both batch + stream transformations
    • stream 1 -> offline store (raw events)
    • stream 1 -> transform -> stream 2 -> online store via feature_store.push(df, push_source)

This issue compounds as data scientists iterate on transformations for training their model, and engineers need to continuously translate this for model serving.

Potential solution

A simple solution may be to allow:

  • FeatureView to have an offline=True option
  • feature_store.push to also append to an existing table in the offline store (e.g. data warehouse) that matches the feature view name.

Alternatives

  • Pushing to the original data source. This works today in Feast if there are no transformations, but given that feature views will soon have transformations (e.g. Batch transformations #2730 or Stream transformations #2597), this would be inconsistent (feature_store.push should be pushing transformed features, not raw data)
  • Feast ingestion (e.g. submitting jobs to Spark) from a topic to both offline + online store sinks

Appendix

Background

Currently, Feast supports pushing features to the online store (https://docs.feast.dev/reference/data-sources/push).

An example may be:

  • Definition of features:
    driver_stats_push_source = PushSource(
        name="driver_stats_push_source", batch_source=driver_stats,
    )
    
    driver_daily_features_view = FeatureView(
        name="driver_daily_features",
        entities=["driver"],
        ttl=timedelta(seconds=8640000000),
        schema=[Field(name="daily_miles_driven", dtype=Float32),],
        online=True,
        source=driver_stats_push_source,
        tags={"production": "True"},
        owner="test2@gmail.com",
    )
  • Pushing features to the online store (e.g. from Spark)
    store.push("driver_stats_push_source", pandas_df)

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    Status

    Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions