Skip to content

Getting historical features from a feature view with multiple entity keys took too long #3867

@romqn1999

Description

@romqn1999

I am encountering a performance issue when retrieving historical features from a multiple entity keys feature view in Feast, using a PostgreSQL offline store.
I have two tables in PostgreSQL - a raw events table with a primary key and an aggregated data table with a composite key. The events table is significantly larger, containing approximately 10 times the number of rows as the aggregated table.

Expected Behavior

Retrieval of historical features from the aggregated data feature view, should be comparable in performance to the retrieval from the raw events feature view.

Current Behavior

Retrieving historical features from the feature view based on the raw events table (larger dataset) completes within a minute. In contrast, the same operation on the aggregated data feature view with multiple entity keys does not complete (runs indefinitely).

Steps to reproduce

Here's a simplified code snippet to illustrate the setup and the issue:

from feast import BigQuerySource, Entity, FeatureView, Field
from feast.types import Float32, Int64, String
from feast.infra.offline_stores.contrib.postgres_offline_store.postgres_source import (
    PostgreSQLSource,
)

event = Entity(name="event", join_keys=["event_id"])
user = Entity(name="user", join_keys=["user_id"])
item = Entity(name="item", join_keys=["item_id"])

events_source = PostgreSQLSource(
    name="events",
    query="SELECT * FROM events",
    timestamp_field="event_timestamp"
)

user_stats_source = PostgreSQLSource(
    name="user_stats",
    query="SELECT * FROM user_stats",
    timestamp_field="event_timestamp"
)

events_fv = FeatureView(
    name="event_fv",
    entities=[event],
    schema=[
        Field(name="user_id", dtype=String),
        Field(name="item_id", dtype=String)
        Field(name="val", dtype=Int64)
    ],
    source=events_source
)

user_stats_fv = FeatureView(
    name="user_activity",
    entities=[user, item],
    schema=[
        Field(name="val", dtype=Int64)
    ],
    source=user_stats_source
)

Retrieve offline features for training

from feast import FeatureStore

store = FeatureStore(repo_path=".")

# Get the latest feature values for unique entities
entity_sql = f"""
    SELECT
        user_id,
        item_id,
        CURRENT_TIMESTAMP() as event_timestamp
    FROM {store.get_data_source("user_stats").get_table_query_string()}
    WHERE event_timestamp BETWEEN '2023-01-01' and '2023-12-31'
"""
data = store.get_historical_features(
    entity_df=entity_sql,
    features=store.get_feature_service("model_v2"),
).to_df()

Specifications

  • Version: 0.34.1
  • Platform: Linux
  • Subsystem: CentOS Linux release 7.9.2009 (Core)

Possible Solution

A workaround that improved performance was to combine multiple keys into a single key by concatenating strings. This approach made historical feature retrieval from the aggregated table faster than from the raw events table. However, this feels more like a workaround than a solution, and I am seeking advice or a fix for handling multiple entity keys efficiently without resorting to key concatenation.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions