Skip to content

Materialization fails if a timestamp column is full of np.NaNs #3189

@chhabrakadabra

Description

@chhabrakadabra

This bug is related to the previously reported issue #2803

If the batch of data being materialized has partial data for a timestamp column, it gets interpreted correctly as datetime and materialization works. But if the column only has nulls, it gets interpreted as np.NaN, which fails to materialize.

Steps to reproduce

Here's a slightly modified script from the issue #2803 that can replicate this behaviour:

from datetime import datetime

import numpy as np
import pandas as pd
from feast import Entity, FeatureStore, FeatureView, Field
from feast.infra.offline_stores.file_source import FileSource
from feast.repo_config import RegistryConfig, RepoConfig
from feast.types import Int32, UnixTimestamp

# create dataset
pd.DataFrame(
    [
        {
            "user_id": 1,
            "event_timestamp": datetime(2022, 5, 1),
            "created": datetime(2022, 5, 1),
            "purchases": 3,
            "last_purchase_date": np.NaN,
        },
        {
            "user_id": 2,
            "event_timestamp": datetime(2022, 5, 2),
            "created": datetime(2022, 5, 2),
            "purchases": 1,
            "last_purchase_date": np.NaN,
        },
        {
            "user_id": 3,
            "event_timestamp": datetime(2022, 5, 2),
            "created": datetime(2022, 5, 2),
            "purchases": 0,
            "last_purchase_date": np.NaN,
        },
    ]
).to_parquet("user_stats.parquet")


user = Entity(name="user_id", description="user id")

user_stats_view = FeatureView(
    name="user_stats",
    entities=[user],
    source=FileSource(
        path="user_stats.parquet",
        timestamp_field="event_timestamp",
        created_timestamp_column="created",
    ),
    schema=[
        Field(name="purchases", dtype=Int32),
        Field(name="last_purchase_date", dtype=UnixTimestamp),
    ],
)

online_store_path = "online_store.db"
registry_path = "registry.db"

repo = RepoConfig(
    registry="registry.db",
    project="feature_store",
    provider="local",
    offline_store="file",
    use_ssl=True,
    is_secure=True,
    validate=True,
)

fs = FeatureStore(config=repo)

fs.apply([user, user_stats_view])

fs.materialize_incremental(end_date=datetime.utcnow())


entity_rows = [{"user_id": i} for i in range(1, 4)]


feature_df = fs.get_online_features(
    features=[
        "user_stats:purchases",
        "user_stats:last_purchase_date",
    ],
    entity_rows=entity_rows,
).to_df()
print(feature_df)

Note that all the values of the last_purchase_date column have been set to np.NaN to trigger this bug. The reproduction script in #2803 had partial data.

Current Behavior

/Users/abhin/.pyenv/virtualenvs/pano/3.9.8/lib/python3.9/site-packages/feast/repo_config.py:207: RuntimeWarning: `entity_key_serialization_version` is either not specified in the feature_store.yaml, or is specified to a value <= 1.This serialization version may cause errors when trying to write fields with the `Long` data type into the online store. Specifying `entity_key_serialization_version` to 2 is recommended for new projects. 
  warnings.warn(
Materializing 1 feature views to 2022-09-07 18:14:33-04:00 into the sqlite online store.

Since the ttl is 0 for feature view user_stats, the start date will be set to 1 year before the current time.
user_stats from 2021-09-08 18:14:33-04:00 to 2022-09-07 18:14:33-04:00:
  0%|                                                                         | 0/3 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/Users/abhin/src/github.com/Shopify/pano/repro/repro.py", line 71, in <module>
    fs.materialize_incremental(end_date=datetime.utcnow())
  File "/Users/abhin/.pyenv/virtualenvs/pano/3.9.8/lib/python3.9/site-packages/feast/feature_store.py", line 1323, in materialize_incremental
    provider.materialize_single_feature_view(
  File "/Users/abhin/.pyenv/virtualenvs/pano/3.9.8/lib/python3.9/site-packages/feast/infra/passthrough_provider.py", line 252, in materialize_single_feature_view
    raise e
  File "/Users/abhin/.pyenv/virtualenvs/pano/3.9.8/lib/python3.9/site-packages/feast/infra/materialization/local_engine.py", line 170, in _materialize_one
    rows_to_write = _convert_arrow_to_proto(
  File "/Users/abhin/.pyenv/virtualenvs/pano/3.9.8/lib/python3.9/site-packages/feast/utils.py", line 206, in _convert_arrow_to_proto
    proto_values_by_column = {
  File "/Users/abhin/.pyenv/virtualenvs/pano/3.9.8/lib/python3.9/site-packages/feast/utils.py", line 207, in <dictcomp>
    column: python_values_to_proto_values(
  File "/Users/abhin/.pyenv/virtualenvs/pano/3.9.8/lib/python3.9/site-packages/feast/type_map.py", line 446, in python_values_to_proto_values
    return _python_value_to_proto_value(value_type, values)
  File "/Users/abhin/.pyenv/virtualenvs/pano/3.9.8/lib/python3.9/site-packages/feast/type_map.py", line 392, in _python_value_to_proto_value
    int_timestamps = _python_datetime_to_int_timestamp(values)
  File "/Users/abhin/.pyenv/virtualenvs/pano/3.9.8/lib/python3.9/site-packages/feast/type_map.py", line 324, in _python_datetime_to_int_timestamp
    int_timestamps.append(int(value))
ValueError: cannot convert float NaN to integer

Expected behaviour

That materialization doesn't break.

Specifications

  • Version: 0.24.0
  • Platform:
  • Subsystem:

Possible Solution

In type_map.py's _python_datetime_to_int_timestamp, we should make a separate path for np.NaN values. Since type(np.NaN) == float, the current code path involves int(np.NaN), which breaks. We could literally detect a np.NaN value and even directly set it to NULL_TIMESTAMP_INT_VALUE.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions