Skip to content

Conversation

@franciscojavierarceo
Copy link
Member

@franciscojavierarceo franciscojavierarceo commented Nov 28, 2025

What this PR does / why we need it:

This PR refactors the transformation system with a cleaner architecture that separates transformation logic from execution. Transformations should focus on HOW to transform data, while FeatureViews handle WHEN and WHERE to execute.

Changes:

  • Added: feature_transformation field to FeatureView for transformation logic
  • Added: transform: bool = True parameter to API methods for per-call control
  • Added: Schema validation for pre-transformed data (transform=False)
  • Added: Transformation execution logic in online/offline feature retrieval
  # Define transformation logic
  @transformation(mode="python")
  def driver_features(inputs):
      return {"conv_plus_acc": [c + a for c, a in zip(inputs["conv_rate"], inputs["acc_rate"])]}

  # Create FeatureView with transformation
  fv = FeatureView(
      name="driver_computed_features",
      source=driver_source,
      entities=[driver],
      schema=[Field(name="conv_plus_acc", dtype=Float64)],
      feature_transformation=driver_features,  # ← NEW
      online=True,
  )

  # API-level transformation control
  # Execute transformations (default)
  features = store.get_online_features(
      features=["driver_computed_features:conv_plus_acc"],
      entity_rows=entity_rows,
      transform=True  # ← NEW (default)
  )

  # Skip transformations for external batch jobs
  features = store.get_online_features(
      features=["driver_computed_features:conv_plus_acc"],
      entity_rows=spark_transformed_rows,
      transform=False  # ← NEW: skip + validate schema
  )

Which issue(s) this PR fixes:

#4584, #5716, #5689

Misc

@franciscojavierarceo franciscojavierarceo requested a review from a team as a code owner November 28, 2025 18:20
@franciscojavierarceo franciscojavierarceo force-pushed the refactor-odfv branch 2 times, most recently from 2899413 to 649c429 Compare December 1, 2025 21:47
from feast.feature_view import FeatureView as FV

for src in sources:
if not isinstance(src, (FV, type(src).__name__ == "FeatureView")):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if not isinstance(src, (FV, type(src).__name__ == "FeatureView")):
if not isinstance(src, FeatureView):

owner: str = "",
mode: Optional[Union["TransformationMode", str]] = None,
feature_transformation: Optional[Transformation] = None,
when: Optional[Union[TransformationTiming, str]] = None,
Copy link
Member

@ntkathole ntkathole Dec 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need updates to proto ? when and online_enabled not serialized in from_proto.

Also, modify __eq__() and __copy__() for consider these new attributes ?

Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
@HaoXuAI
Copy link
Collaborator

HaoXuAI commented Dec 16, 2025

I think when can be a bit confusing. even for 'batch', it can be 'on_read' or 'on_write'. I think on a different angle, should transformation be in the scope of writing, and isn't transformation only responsible for transforming the data?

FeatureView as the actually data assets, I think makes more sense to having the API.

franciscojavierarceo and others added 4 commits December 16, 2025 16:57
… auto-inference

Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
@franciscojavierarceo
Copy link
Member Author

I think when can be a bit confusing. even for 'batch', it can be 'on_read' or 'on_write'. I think on a different angle, should transformation be in the scope of writing, and isn't transformation only responsible for transforming the data?

FeatureView as the actually data assets, I think makes more sense to having the API.

Yeah, that's fair. @HaoXuAI I updated the PR, lmk what you think.

self.transform_when = transform_when

# Auto-infer online setting based on transform_when pattern
if transform_when in [
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So why batch on read/write forces online to be true?
I think we need to think about if we really need a new tranform_when config. And if we need it, what is its relation with online/offline.
I think it might be counterintuitive that if this featureview is only used by batch or stream or on demand. what if the user defines it today as one mode and wants to change it to another one or allow both mode in the future?
My personal pint of view is removing any config for this. As long as you define the transformation, it will be applied to run before writing to destination. Use online/offline to control the destination only, if both are off, then no write. For any transformation needed on read, this should be configured at API level. Fo example, you can put get_online_feature(transform=true) to force transformation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to think about if we really need a new tranform_when config. And if we need it, what is its relation with online/offline.

Here's how things actually work:

  1. Batch
  • Sometimes data can be transformed online
    1. On Reads
    2. On Writes
  • Sometimes not, and that's basically a batch transform with no feature server transformation and pure materialization
  1. Streaming
  • Handles historical data the same way it handles online serving

that's really what the transform_when indicates.

We need something other than online because the rest will be executed by the server and you can't orchestrate the transformation on write during retrieval.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with renaming something but we need something equivalent to write_to_online_store that's more obvious.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

online = true means write to online store.

transform_on_write should be removed from the server. it's a hacky way. #5283. this PR doesn't get enough reviews I think.
#5300. I think this one is also not ideal implementation as well.

When you use get_online_feature, you should assign it a transform=true to run transformation in "retrieval".

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've read through the feature server code and on demand feature view implementation. I think a lot code needs to be refactored. And if you think there is some lines of code won't work you can tag me. I'll try to explain it

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes we agreed on this direction and I'm going to implement things that way. Just need to finish the work.

Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants