Skip to content

get_historical_features does not work on join keys with field mappings #4889

@aloysius-lim

Description

@aloysius-lim

Expected Behavior

Given an Entity where the join key column is called something else in the data source, a field_mapping can be set on the data source to map the source column name to the join key. get_historical_features should then recognize that the join key has a field mapping, and generate the correct alias in the query.

For example:

from feast import Entity, Field, FeatureStore, FeatureView
from feast.types import Float32, String
from feast.infra.offline_stores.contrib.spark_offline_store.spark_source import SparkSource

# Initialize Feature Store.
store = FeatureStore(...)

# "driver_id" is used in the Feature Store.
driver = Entity(name="driver", join_keys=["driver_id"])

# Using SparkSource as an example, but this applies to other sources.
# Source data contains a primary key called "id". This is mapped to the join key "driver_id".
driver_stats_src = SparkSource(
    name="driver_stats",
    field_mapping={"id": "driver_id"},
    path=...,
    file_format=...,
)
driver_stats_fv = FeatureView(
    name="driver_stats",
    source= driver_stats_src,
    entities=[driver],
    schema=[
        # join key must be specified in the schema, else it is not included in driver_stats_fv.entity_columns
        Field(name="driver_id", dtype=String),
        Field(name="stat1", dtype=Float32),
        Field(name="stat2", dtype=Float32),
    ]
)

# Get historical features
store.get_historical_features(
    entity_df=...,
    features=[
        "driver_stats:stat1",
        "driver_stats:stat2",
    ]

When get_historical_features is run, the alias id AS driver_id should be provided to the query. In the case of Spark, for example, this should be the query:

driver_stats__subquery AS (
    SELECT
        event_timestamp as event_timestamp,
        created as created_timestamp,

        id AS driver_id,
        
        stat1 as stat1, stat2 as stat2
    FROM `feast_entity_df_677a1a6fd13443c6b0e8ccc059b25f01`
    WHERE event_timestamp <= '2025-01-05T14:00:00'
)

Current Behavior

This is what currently happens (Spark example):

pyspark.errors.exceptions.captured.AnalysisException: [UNRESOLVED_COLUMN.WITH_SUGGESTION] A column or function parameter with name `driver_id` cannot be resolved. Did you mean one of the following? [`id`, `stat1`, `stat2`]

Underlying Spark query:

driver_stats__subquery AS (
    SELECT
        event_timestamp as event_timestamp,
        created as created_timestamp,

        -- Here is the problem.
        driver_id AS driver_id,

        stat1 as stat1, stat2 as stat2
    FROM `feast_entity_df_677a1a6fd13443c6b0e8ccc059b25f01`
    WHERE event_timestamp <= '2025-01-05T14:00:00'
)

Steps to reproduce

See example above.

Specifications

  • Version: 0.42.0
  • Platform: macOS 14.6.1
  • Subsystem:

Possible Solution

See PR #4886

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions