-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Closed
Labels
Description
Expected Behavior
Given an Entity where the join key column is called something else in the data source, a field_mapping can be set on the data source to map the source column name to the join key. get_historical_features should then recognize that the join key has a field mapping, and generate the correct alias in the query.
For example:
from feast import Entity, Field, FeatureStore, FeatureView
from feast.types import Float32, String
from feast.infra.offline_stores.contrib.spark_offline_store.spark_source import SparkSource
# Initialize Feature Store.
store = FeatureStore(...)
# "driver_id" is used in the Feature Store.
driver = Entity(name="driver", join_keys=["driver_id"])
# Using SparkSource as an example, but this applies to other sources.
# Source data contains a primary key called "id". This is mapped to the join key "driver_id".
driver_stats_src = SparkSource(
name="driver_stats",
field_mapping={"id": "driver_id"},
path=...,
file_format=...,
)
driver_stats_fv = FeatureView(
name="driver_stats",
source= driver_stats_src,
entities=[driver],
schema=[
# join key must be specified in the schema, else it is not included in driver_stats_fv.entity_columns
Field(name="driver_id", dtype=String),
Field(name="stat1", dtype=Float32),
Field(name="stat2", dtype=Float32),
]
)
# Get historical features
store.get_historical_features(
entity_df=...,
features=[
"driver_stats:stat1",
"driver_stats:stat2",
]When get_historical_features is run, the alias id AS driver_id should be provided to the query. In the case of Spark, for example, this should be the query:
driver_stats__subquery AS (
SELECT
event_timestamp as event_timestamp,
created as created_timestamp,
id AS driver_id,
stat1 as stat1, stat2 as stat2
FROM `feast_entity_df_677a1a6fd13443c6b0e8ccc059b25f01`
WHERE event_timestamp <= '2025-01-05T14:00:00'
)Current Behavior
This is what currently happens (Spark example):
pyspark.errors.exceptions.captured.AnalysisException: [UNRESOLVED_COLUMN.WITH_SUGGESTION] A column or function parameter with name `driver_id` cannot be resolved. Did you mean one of the following? [`id`, `stat1`, `stat2`]
Underlying Spark query:
driver_stats__subquery AS (
SELECT
event_timestamp as event_timestamp,
created as created_timestamp,
-- Here is the problem.
driver_id AS driver_id,
stat1 as stat1, stat2 as stat2
FROM `feast_entity_df_677a1a6fd13443c6b0e8ccc059b25f01`
WHERE event_timestamp <= '2025-01-05T14:00:00'
)Steps to reproduce
See example above.
Specifications
- Version: 0.42.0
- Platform: macOS 14.6.1
- Subsystem:
Possible Solution
See PR #4886