-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Description
I am encountering a performance issue when retrieving historical features from a multiple entity keys feature view in Feast, using a PostgreSQL offline store.
I have two tables in PostgreSQL - a raw events table with a primary key and an aggregated data table with a composite key. The events table is significantly larger, containing approximately 10 times the number of rows as the aggregated table.
Expected Behavior
Retrieval of historical features from the aggregated data feature view, should be comparable in performance to the retrieval from the raw events feature view.
Current Behavior
Retrieving historical features from the feature view based on the raw events table (larger dataset) completes within a minute. In contrast, the same operation on the aggregated data feature view with multiple entity keys does not complete (runs indefinitely).
Steps to reproduce
Here's a simplified code snippet to illustrate the setup and the issue:
from feast import BigQuerySource, Entity, FeatureView, Field
from feast.types import Float32, Int64, String
from feast.infra.offline_stores.contrib.postgres_offline_store.postgres_source import (
PostgreSQLSource,
)
event = Entity(name="event", join_keys=["event_id"])
user = Entity(name="user", join_keys=["user_id"])
item = Entity(name="item", join_keys=["item_id"])
events_source = PostgreSQLSource(
name="events",
query="SELECT * FROM events",
timestamp_field="event_timestamp"
)
user_stats_source = PostgreSQLSource(
name="user_stats",
query="SELECT * FROM user_stats",
timestamp_field="event_timestamp"
)
events_fv = FeatureView(
name="event_fv",
entities=[event],
schema=[
Field(name="user_id", dtype=String),
Field(name="item_id", dtype=String)
Field(name="val", dtype=Int64)
],
source=events_source
)
user_stats_fv = FeatureView(
name="user_activity",
entities=[user, item],
schema=[
Field(name="val", dtype=Int64)
],
source=user_stats_source
)Retrieve offline features for training
from feast import FeatureStore
store = FeatureStore(repo_path=".")
# Get the latest feature values for unique entities
entity_sql = f"""
SELECT
user_id,
item_id,
CURRENT_TIMESTAMP() as event_timestamp
FROM {store.get_data_source("user_stats").get_table_query_string()}
WHERE event_timestamp BETWEEN '2023-01-01' and '2023-12-31'
"""
data = store.get_historical_features(
entity_df=entity_sql,
features=store.get_feature_service("model_v2"),
).to_df()Specifications
- Version: 0.34.1
- Platform: Linux
- Subsystem: CentOS Linux release 7.9.2009 (Core)
Possible Solution
A workaround that improved performance was to combine multiple keys into a single key by concatenating strings. This approach made historical feature retrieval from the aggregated table faster than from the raw events table. However, this feels more like a workaround than a solution, and I am seeking advice or a fix for handling multiple entity keys efficiently without resorting to key concatenation.