Skip to content

Batch transformations #2730

@adchia

Description

@adchia

See also: Feast RFC-028: Batch Transformations
See also: Feast Slack: #feast-feature-transformations

Overview

Users can today use Feast to track the final view of batch features. At retrieval time, users need to specify entity join_keys for each feature view. An example may be:

batch_transform

Problem 1: Difficult to iterate on transformation logic and avoid duplication, especially when there are downstream models

The above flow requires that users maintain transformation logic outside of Feast. For users to iterate on transformation logic + adapt to different use cases, they need to find the relevant pipeline logic outside of Feast.

If a data scientist wants to help author the necessary features for a specific model, today the data scientist needs to:

  1. Setup transformation logic that outputs the view above
  2. Register the view as a data source in Feast (+ feature view)
  3. Register a feature service referencing the feature view

That first step can be problematic for several reasons

  • data scientists are responsible for creating views of data without visibility into what other views already exist and how they are produced.
  • if the transformation logic changes, there's no easy way to know that it may impact downstream models that depend on the output features.

Note: Feast does not intend to be a full solution for describing complex sequences of transformation DAGs in the near future. This is for simple transformations that can address the 80%.

Problem 2: Difficult to retrieve features that require a side lookup

This is best explained by example: if you want to lookup user features (including features from their country of origin), but you have only the user_id at request time, then at training data generation (or materialization) time, you'd ideally want to do a left join onto the user features, appending in country features.

Potential Solution

There are two steps:

  1. Enable batch transformations to be defined in Feast
  2. Wrap up Improved feature view and model versioning #2728, which will ensure that changes to transformations are blocked if they impact feature services in prod.

Enabling batch transformations in Feast

See the RFC above for more details. In short, there should be a way for:

SQL centric transformations (offline store agnostic)

  • Writing SQL transformations that execute in data warehouses to produce views.
@batch_feature_view(
  sources=[data_source]
  name="project.dataset.view",
  mode="snowflake_sql",
  timestamp_field="feature_timestamp",
)
def my_feature_view(data_source):
   return f"""
     SELECT
           transaction_count + 100,
           user_id,
           feature_timestamp
     FROM {data_source}
   """

Pythonic transformations (e.g. PySpark)

  • The expectation in Spark at least would be that users bring a Spark context
 @batch_feature_view(
      name="project.dataset.view",
      mode="pyspark",
      timestamp_field="feature_timestamp",
      sources=[credit_scores]
   )
   def user_has_good_credit(credit_scores):
       from pyspark.sql import functions as f
       return credit_scores \
           .withColumn('user_has_good_credit', when(col('credit_score') > 670, 1).otherwise(0)) \
           .select('user_id', 'user_has_good_credit', 'timestamp')

Doing side-lookups within Feast

  • For example where you want to lookup user features (including features from their country of origin), but you have only the user_id at request time:
@batch_feature_view(
   name="project.dataset.view"
   mode="bigquery_sql",
   entity="user"
   sources=[user_metadata, country_metadata], 
   schema=[
        Field(name="user_id", dtype=Int64),
        Field(name="country_id", dtype=Int64),
        Field(name="country_category", dtype=Int64),
    ],
)
def my_feature_view(user_metadata, country_metadata):
    return f'''        
      SELECT 
        user_id, 
        country_id
        event_timestamp,
        category as country_category
      FROM {user_metadata} LEFT JOIN {country_metadata} ON
        {user_metadata}.country_id = {country_metadata}.id      
    '''
  • In all above cases above, opportunity for wrappers that help data scientists author features (e.g. aggregations or common types of features)

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    Status

    Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions