Batch transformations

See also: [Feast RFC-028: Batch Transformations](https://docs.google.com/document/d/1964OkzuBljifDvkV-0fakp2uaijnVzdwWNGdz7Vz50A/edit)
See also: [Feast Slack: #feast-feature-transformations](https://tectonfeast.slack.com/archives/C03A2CH7GAG)

# Overview
Users can today use Feast to track the final view of batch features. At retrieval time, users need to specify entity `join_key`s for each feature view. An example may be:

![batch_transform](https://user-images.githubusercontent.com/1476382/170064418-75eb84ba-6472-4ae1-a4eb-d2471d70b962.png)

## Problem 1: Difficult to iterate on transformation logic and avoid duplication, especially when there are downstream models
The above flow requires that users maintain transformation logic outside of Feast. For users to iterate on transformation logic + adapt to different use cases, they need to find the relevant pipeline logic outside of Feast.

If a data scientist wants to help author the necessary features for a specific model, today the data scientist needs to:
1. Setup transformation logic that outputs the view above
2. Register the view as a data source in Feast (+ feature view)
3. Register a feature service referencing the feature view

That first step can be problematic for several reasons
- data scientists are responsible for creating views of data without visibility into what other views already exist and how they are produced. 
- if the transformation logic changes, there's no easy way to know that it may impact downstream models that depend on the output features.

**Note:** Feast does not intend to be a full solution for describing complex sequences of transformation DAGs in the near future. This is for simple transformations that can address the 80%.

## Problem 2: Difficult to retrieve features that require a side lookup
This is best explained by example: if you want to lookup user features (including features from their country of origin), but you have only the `user_id` at request time, then at training data generation (or materialization) time, you'd ideally want to do a left join onto the user features, appending in country features.

# Potential Solution
There are two steps:
1. Enable batch transformations to be defined in Feast
2. Wrap up https://github.com/feast-dev/feast/issues/2728, which will ensure that changes to transformations are blocked if they impact feature services in prod.

## Enabling batch transformations in Feast
See the RFC above for more details. In short, there should be a way for:
### SQL centric transformations (offline store agnostic)
   - Writing SQL transformations that execute in data warehouses to produce views. 
   ```python
   @batch_feature_view(
     sources=[data_source]
     name="project.dataset.view",
     mode="snowflake_sql",
     timestamp_field="feature_timestamp",
  )
  def my_feature_view(data_source):
      return f"""
        SELECT
              transaction_count + 100,
              user_id,
              feature_timestamp
        FROM {data_source}
      """
   ```
### Pythonic transformations (e.g. PySpark)
   - The expectation in Spark at least would be that users bring a Spark context 
 ```python
  @batch_feature_view(
       name="project.dataset.view",
       mode="pyspark",
       timestamp_field="feature_timestamp",
       sources=[credit_scores]
    )
    def user_has_good_credit(credit_scores):
        from pyspark.sql import functions as f
        return credit_scores \
            .withColumn('user_has_good_credit', when(col('credit_score') > 670, 1).otherwise(0)) \
            .select('user_id', 'user_has_good_credit', 'timestamp')
   ```
### Doing side-lookups within Feast
  - For example where you want to lookup user features (including features from their country of origin), but you have only the `user_id` at request time:
  ```python
  @batch_feature_view(
     name="project.dataset.view"
     mode="bigquery_sql",
     entity="user"
     sources=[user_metadata, country_metadata], 
     schema=[
          Field(name="user_id", dtype=Int64),
          Field(name="country_id", dtype=Int64),
          Field(name="country_category", dtype=Int64),
      ],
  )
  def my_feature_view(user_metadata, country_metadata):
      return f'''        
        SELECT 
          user_id, 
          country_id
          event_timestamp,
          category as country_category
        FROM {user_metadata} LEFT JOIN {country_metadata} ON
          {user_metadata}.country_id = {country_metadata}.id      
      '''
  ``` 
- In all above cases above, opportunity for wrappers that help data scientists author features (e.g. aggregations or common types of features)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Batch transformations #2730

Overview

Problem 1: Difficult to iterate on transformation logic and avoid duplication, especially when there are downstream models

Problem 2: Difficult to retrieve features that require a side lookup

Potential Solution

Enabling batch transformations in Feast

SQL centric transformations (offline store agnostic)

Pythonic transformations (e.g. PySpark)

Doing side-lookups within Feast

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Batch transformations #2730

Description

Overview

Problem 1: Difficult to iterate on transformation logic and avoid duplication, especially when there are downstream models

Problem 2: Difficult to retrieve features that require a side lookup

Potential Solution

Enabling batch transformations in Feast

SQL centric transformations (offline store agnostic)

Pythonic transformations (e.g. PySpark)

Doing side-lookups within Feast

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions