-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Description
This bug is related to the previously reported issue #2803
If the batch of data being materialized has partial data for a timestamp column, it gets interpreted correctly as datetime and materialization works. But if the column only has nulls, it gets interpreted as np.NaN, which fails to materialize.
Steps to reproduce
Here's a slightly modified script from the issue #2803 that can replicate this behaviour:
from datetime import datetime
import numpy as np
import pandas as pd
from feast import Entity, FeatureStore, FeatureView, Field
from feast.infra.offline_stores.file_source import FileSource
from feast.repo_config import RegistryConfig, RepoConfig
from feast.types import Int32, UnixTimestamp
# create dataset
pd.DataFrame(
[
{
"user_id": 1,
"event_timestamp": datetime(2022, 5, 1),
"created": datetime(2022, 5, 1),
"purchases": 3,
"last_purchase_date": np.NaN,
},
{
"user_id": 2,
"event_timestamp": datetime(2022, 5, 2),
"created": datetime(2022, 5, 2),
"purchases": 1,
"last_purchase_date": np.NaN,
},
{
"user_id": 3,
"event_timestamp": datetime(2022, 5, 2),
"created": datetime(2022, 5, 2),
"purchases": 0,
"last_purchase_date": np.NaN,
},
]
).to_parquet("user_stats.parquet")
user = Entity(name="user_id", description="user id")
user_stats_view = FeatureView(
name="user_stats",
entities=[user],
source=FileSource(
path="user_stats.parquet",
timestamp_field="event_timestamp",
created_timestamp_column="created",
),
schema=[
Field(name="purchases", dtype=Int32),
Field(name="last_purchase_date", dtype=UnixTimestamp),
],
)
online_store_path = "online_store.db"
registry_path = "registry.db"
repo = RepoConfig(
registry="registry.db",
project="feature_store",
provider="local",
offline_store="file",
use_ssl=True,
is_secure=True,
validate=True,
)
fs = FeatureStore(config=repo)
fs.apply([user, user_stats_view])
fs.materialize_incremental(end_date=datetime.utcnow())
entity_rows = [{"user_id": i} for i in range(1, 4)]
feature_df = fs.get_online_features(
features=[
"user_stats:purchases",
"user_stats:last_purchase_date",
],
entity_rows=entity_rows,
).to_df()
print(feature_df)Note that all the values of the last_purchase_date column have been set to np.NaN to trigger this bug. The reproduction script in #2803 had partial data.
Current Behavior
/Users/abhin/.pyenv/virtualenvs/pano/3.9.8/lib/python3.9/site-packages/feast/repo_config.py:207: RuntimeWarning: `entity_key_serialization_version` is either not specified in the feature_store.yaml, or is specified to a value <= 1.This serialization version may cause errors when trying to write fields with the `Long` data type into the online store. Specifying `entity_key_serialization_version` to 2 is recommended for new projects.
warnings.warn(
Materializing 1 feature views to 2022-09-07 18:14:33-04:00 into the sqlite online store.
Since the ttl is 0 for feature view user_stats, the start date will be set to 1 year before the current time.
user_stats from 2021-09-08 18:14:33-04:00 to 2022-09-07 18:14:33-04:00:
0%| | 0/3 [00:00<?, ?it/s]
Traceback (most recent call last):
File "/Users/abhin/src/github.com/Shopify/pano/repro/repro.py", line 71, in <module>
fs.materialize_incremental(end_date=datetime.utcnow())
File "/Users/abhin/.pyenv/virtualenvs/pano/3.9.8/lib/python3.9/site-packages/feast/feature_store.py", line 1323, in materialize_incremental
provider.materialize_single_feature_view(
File "/Users/abhin/.pyenv/virtualenvs/pano/3.9.8/lib/python3.9/site-packages/feast/infra/passthrough_provider.py", line 252, in materialize_single_feature_view
raise e
File "/Users/abhin/.pyenv/virtualenvs/pano/3.9.8/lib/python3.9/site-packages/feast/infra/materialization/local_engine.py", line 170, in _materialize_one
rows_to_write = _convert_arrow_to_proto(
File "/Users/abhin/.pyenv/virtualenvs/pano/3.9.8/lib/python3.9/site-packages/feast/utils.py", line 206, in _convert_arrow_to_proto
proto_values_by_column = {
File "/Users/abhin/.pyenv/virtualenvs/pano/3.9.8/lib/python3.9/site-packages/feast/utils.py", line 207, in <dictcomp>
column: python_values_to_proto_values(
File "/Users/abhin/.pyenv/virtualenvs/pano/3.9.8/lib/python3.9/site-packages/feast/type_map.py", line 446, in python_values_to_proto_values
return _python_value_to_proto_value(value_type, values)
File "/Users/abhin/.pyenv/virtualenvs/pano/3.9.8/lib/python3.9/site-packages/feast/type_map.py", line 392, in _python_value_to_proto_value
int_timestamps = _python_datetime_to_int_timestamp(values)
File "/Users/abhin/.pyenv/virtualenvs/pano/3.9.8/lib/python3.9/site-packages/feast/type_map.py", line 324, in _python_datetime_to_int_timestamp
int_timestamps.append(int(value))
ValueError: cannot convert float NaN to integer
Expected behaviour
That materialization doesn't break.
Specifications
- Version: 0.24.0
- Platform:
- Subsystem:
Possible Solution
In type_map.py's _python_datetime_to_int_timestamp, we should make a separate path for np.NaN values. Since type(np.NaN) == float, the current code path involves int(np.NaN), which breaks. We could literally detect a np.NaN value and even directly set it to NULL_TIMESTAMP_INT_VALUE.