Skip to content

Latest commit

 

History

History
103 lines (77 loc) · 4.32 KB

File metadata and controls

103 lines (77 loc) · 4.32 KB

MongoDB source (contrib)

Description

MongoDB data sources are MongoDB collections that can be used as a source for feature data. The MongoDBSource points at a MongoDB collection and provides the metadata Feast needs to read historical features from the offline store's collection.

Examples

Defining a MongoDB source:

from feast.infra.offline_stores.contrib.mongodb_offline_store.mongodb import (
    MongoDBSource,
)

driver_stats_source = MongoDBSource(
    name="driver_stats",
    timestamp_field="event_timestamp",
    created_timestamp_column="created_at",
)

The name field becomes the feature_view discriminator stored in every document in the feature_history collection.

Configuration options such as connection_string, database, and collection are inherited from the offline store configuration in feature_store.yaml.

The full set of configuration options is available here.

Vector Search

The MongoDB online store supports MongoDB Vector Search, enabling similarity search over feature embeddings stored in MongoDB. This is powered by the $vectorSearch aggregation stage and supports MongoDB Atlas, self-hosted MongoDB with Atlas Search indexes, and the mongodb/mongodb-atlas-local Docker image for local development.

Configuration

Enable vector search in your feature_store.yaml:

project: my_project
provider: local
online_store:
  type: mongodb
  connection_string: mongodb+srv://<user>:<pass>@cluster.mongodb.net  # pragma: allowlist secret
  vector_enabled: true
  similarity: cosine  # cosine | euclidean | dotProduct
  vector_index_wait_timeout: 60  # seconds to wait for index to become queryable
  vector_index_wait_poll_interval: 1.0  # seconds between polls

Defining a Feature View with Vector Index

Mark embedding fields with vector_index=True and specify vector_length:

from feast import Entity, FeatureView, Field, FileSource
from feast.types import Array, Float32, Int64, String
from datetime import timedelta

item_embeddings = FeatureView(
    name="item_embeddings",
    entities=[Entity(name="item_id", join_keys=["item_id"])],
    schema=[
        Field(
            name="embedding",
            dtype=Array(Float32),
            vector_index=True,
            vector_length=384,
            vector_search_metric="cosine",
        ),
        Field(name="title", dtype=String),
        Field(name="item_id", dtype=Int64),
    ],
    source=FileSource(path="items.parquet", timestamp_field="event_timestamp"),
    ttl=timedelta(hours=24),
)

When feast apply (or store.update()) runs with vector_enabled=True, MongoDB vector search indexes are automatically created for any field with vector_index=True. Indexes are also automatically dropped when feature views are removed.

Retrieving Documents via Vector Search

Use retrieve_online_documents_v2() to perform similarity search:

store = FeatureStore(repo_path=".")
results = store.retrieve_online_documents_v2(
    features=["item_embeddings:embedding", "item_embeddings:title"],
    query=[0.1, 0.2, ...],  # query vector
    top_k=5,
)

How It Works

  • Index creation: update() creates a MongoDB vector search index named <feature_view>__<field>__vs_index for each vector-indexed field. It waits for the index to reach READY status before proceeding.
  • Query execution: retrieve_online_documents_v2() builds a $vectorSearch aggregation pipeline with numCandidates = max(top_k * 10, 100) and the specified limit.
  • Score: Results include a distance field populated from $meta: "vectorSearchScore".
  • BSON compatibility: Query vectors are coerced to native Python floats to avoid numpy serialization issues.
  • Idempotency: Calling update() multiple times will not duplicate indexes.

Supported Types

MongoDB data sources support all eight primitive types (bytes, string, int32, int64, float32, float64, bool, timestamp) and their corresponding array types. Complex types such as Map and Struct are preserved through the MongoDB document model. For a comparison against other batch data sources, please see here.