Skip to content

Milvus Online Store Dimension Mismatch Error in Push API and Materialization #5551

@abhijeet-dhumal

Description

@abhijeet-dhumal

Summary

Feast's Milvus online store integration has a critical dimension mismatch bug that affects both the push API and materialization approaches. When storing embeddings with correct dimensions (384), Feast internally transforms the data incorrectly, causing Milvus to reject the data with dimension errors.

Environment

  • Feast version: 0.51.0
  • Python version: 3.12.11
  • pymilvus version: 2.3.0+
  • OS: macOS (Darwin 24.5.0)
  • Milvus: milvus-lite (via path: data/online_store.db)

Bug Description

Error Message

ERROR:pymilvus.decorators:RPC error: [upsert_rows], <MilvusException: (code=65535, message=the length(7695) of float data should divide the dim(384): )>

Expected Behavior

  • Input: 5 embeddings × 384 dimensions = 1920 total elements
  • Feast should store these embeddings correctly in Milvus
  • Expected elements sent to Milvus: 1920

Actual Behavior

  • Input: 5 embeddings × 384 dimensions = 1920 total elements
  • Feast transforms this to 7695 elements (factor of ~4x)
  • Milvus rejects the data because 7695 ÷ 384 = 20.04... (not integer)

Steps to Reproduce

1. Feature Store Configuration

# feast_feature_repo/feature_store.yaml
project: rag
provider: local
registry: data/registry.db
online_store:
  type: milvus
  path: data/online_store.db
  vector_enabled: true
  embedding_dim: 384
  index_type: "FLAT"
  metric_type: "L2"
offline_store:
  type: file
entity_key_serialization_version: 3
auth:
  type: no_auth

2. Feature Definitions

from feast import Entity, FeatureView, Field, FileSource, PushSource
from feast.types import Array, Float32, String, Int64
from feast.value_type import ValueType
from datetime import timedelta

document = Entity(
    name="document_id",
    value_type=ValueType.STRING,
    description="Unique identifier for document chunks"
)

document_embeddings_source = FileSource(
    name="document_embeddings_source",
    path="data/document_embeddings.parquet",
    timestamp_field="event_timestamp",
    created_timestamp_column="created_timestamp",
)

document_embeddings_push_source = PushSource(
    name="document_embeddings_push_source",
    batch_source=document_embeddings_source,
)

document_embeddings = FeatureView(
    name="document_embeddings",
    entities=[document],
    ttl=timedelta(days=365),
    schema=[
        Field(name="embedding", dtype=Array(Float32), vector_index=True),
        Field(name="chunk_text", dtype=String),
        Field(name="document_title", dtype=String),
        Field(name="chunk_index", dtype=Int64),
        Field(name="file_path", dtype=String),
        Field(name="chunk_length", dtype=Int64),
    ],
    online=True,
    source=document_embeddings_push_source,
    tags={"team": "rag", "version": "v3"},
)

3. Reproduce with Push API

import pandas as pd
import numpy as np
from datetime import datetime
from sentence_transformers import SentenceTransformer
from feast import FeatureStore
from feast.data_format import PushMode

# Generate test embeddings (384 dimensions)
model = SentenceTransformer('all-MiniLM-L6-v2')
texts = [
    'Test document 1',
    'Test document 2', 
    'Test document 3',
    'Test document 4',
    'Test document 5'
]
embeddings = model.encode(texts)  # Shape: (5, 384)

# Create DataFrame
feature_data = []
for i, (text, embedding) in enumerate(zip(texts, embeddings)):
    feature_data.append({
        "document_id": f"test_doc_{i}",
        "embedding": embedding.tolist(),  # Convert to list as per docs
        "chunk_text": text,
        "document_title": "test_document.md",
        "chunk_index": i,
        "file_path": "test_path",
        "chunk_length": len(text),
        "event_timestamp": pd.Timestamp.now(tz='UTC'),
        "created_timestamp": pd.Timestamp.now(tz='UTC')
    })

df = pd.DataFrame(feature_data)
print(f"Input data: {len(df)} rows, {len(df) * 384} total elements")

# Initialize Feast store
fs = FeatureStore(repo_path="feast_feature_repo")

# This will fail with dimension mismatch
fs.push(
    push_source_name="document_embeddings_push_source",
    df=df,
    to=PushMode.ONLINE_AND_OFFLINE
)

4. Reproduce with Materialization

# Save to parquet file
df.to_parquet('feast_feature_repo/data/document_embeddings.parquet', index=False)

# Try materialization
from datetime import timedelta
end_time = datetime.now()
start_time = end_time - timedelta(hours=1)

# This will also fail with same dimension mismatch
fs.materialize(
    start_date=start_time,
    end_date=end_time,
    feature_views=["document_embeddings"]
)

Investigation Results

Data Validation

Our debugging confirmed:

  • ✅ Input embeddings are exactly 384 dimensions each
  • ✅ DataFrame contains 5 rows × 384 = 1920 total elements
  • ✅ Embeddings converted to Python lists correctly
  • ✅ Data types are correct (Array(Float32))
  • ❌ Feast somehow transforms 1920 → 7695 elements internally

Affected Methods

  1. Push API: store.push() with PushMode.ONLINE_AND_OFFLINE
  2. Materialization: store.materialize() from parquet files
  3. Both fail with identical dimension mismatch errors

Expected Fix

Feast should correctly handle Array(Float32) fields when:

  1. Pushing data via push API
  2. Materializing data from parquet files
  3. The dimension transformation logic needs debugging/fixing

Potential Root Cause

The issue appears to be in Feast's internal serialization/transformation of Array(Float32) fields when interfacing with Milvus. The ~4x multiplication factor (1920 → 7695) suggests there might be:

  • Incorrect flattening of nested arrays
  • Multiple serialization passes
  • Data type conversion issues in the Milvus online store adapter

Workaround

Currently using direct pymilvus.MilvusClient integration which works perfectly with the same data, confirming the issue is within Feast's Milvus adapter.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions