-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Closed
Labels
Description
Summary
Feast's Milvus online store integration has a critical dimension mismatch bug that affects both the push API and materialization approaches. When storing embeddings with correct dimensions (384), Feast internally transforms the data incorrectly, causing Milvus to reject the data with dimension errors.
Environment
- Feast version: 0.51.0
- Python version: 3.12.11
- pymilvus version: 2.3.0+
- OS: macOS (Darwin 24.5.0)
- Milvus: milvus-lite (via
path: data/online_store.db)
Bug Description
Error Message
ERROR:pymilvus.decorators:RPC error: [upsert_rows], <MilvusException: (code=65535, message=the length(7695) of float data should divide the dim(384): )>
Expected Behavior
- Input: 5 embeddings × 384 dimensions = 1920 total elements
- Feast should store these embeddings correctly in Milvus
- Expected elements sent to Milvus: 1920
Actual Behavior
- Input: 5 embeddings × 384 dimensions = 1920 total elements
- Feast transforms this to 7695 elements (factor of ~4x)
- Milvus rejects the data because 7695 ÷ 384 = 20.04... (not integer)
Steps to Reproduce
1. Feature Store Configuration
# feast_feature_repo/feature_store.yaml
project: rag
provider: local
registry: data/registry.db
online_store:
type: milvus
path: data/online_store.db
vector_enabled: true
embedding_dim: 384
index_type: "FLAT"
metric_type: "L2"
offline_store:
type: file
entity_key_serialization_version: 3
auth:
type: no_auth2. Feature Definitions
from feast import Entity, FeatureView, Field, FileSource, PushSource
from feast.types import Array, Float32, String, Int64
from feast.value_type import ValueType
from datetime import timedelta
document = Entity(
name="document_id",
value_type=ValueType.STRING,
description="Unique identifier for document chunks"
)
document_embeddings_source = FileSource(
name="document_embeddings_source",
path="data/document_embeddings.parquet",
timestamp_field="event_timestamp",
created_timestamp_column="created_timestamp",
)
document_embeddings_push_source = PushSource(
name="document_embeddings_push_source",
batch_source=document_embeddings_source,
)
document_embeddings = FeatureView(
name="document_embeddings",
entities=[document],
ttl=timedelta(days=365),
schema=[
Field(name="embedding", dtype=Array(Float32), vector_index=True),
Field(name="chunk_text", dtype=String),
Field(name="document_title", dtype=String),
Field(name="chunk_index", dtype=Int64),
Field(name="file_path", dtype=String),
Field(name="chunk_length", dtype=Int64),
],
online=True,
source=document_embeddings_push_source,
tags={"team": "rag", "version": "v3"},
)3. Reproduce with Push API
import pandas as pd
import numpy as np
from datetime import datetime
from sentence_transformers import SentenceTransformer
from feast import FeatureStore
from feast.data_format import PushMode
# Generate test embeddings (384 dimensions)
model = SentenceTransformer('all-MiniLM-L6-v2')
texts = [
'Test document 1',
'Test document 2',
'Test document 3',
'Test document 4',
'Test document 5'
]
embeddings = model.encode(texts) # Shape: (5, 384)
# Create DataFrame
feature_data = []
for i, (text, embedding) in enumerate(zip(texts, embeddings)):
feature_data.append({
"document_id": f"test_doc_{i}",
"embedding": embedding.tolist(), # Convert to list as per docs
"chunk_text": text,
"document_title": "test_document.md",
"chunk_index": i,
"file_path": "test_path",
"chunk_length": len(text),
"event_timestamp": pd.Timestamp.now(tz='UTC'),
"created_timestamp": pd.Timestamp.now(tz='UTC')
})
df = pd.DataFrame(feature_data)
print(f"Input data: {len(df)} rows, {len(df) * 384} total elements")
# Initialize Feast store
fs = FeatureStore(repo_path="feast_feature_repo")
# This will fail with dimension mismatch
fs.push(
push_source_name="document_embeddings_push_source",
df=df,
to=PushMode.ONLINE_AND_OFFLINE
)4. Reproduce with Materialization
# Save to parquet file
df.to_parquet('feast_feature_repo/data/document_embeddings.parquet', index=False)
# Try materialization
from datetime import timedelta
end_time = datetime.now()
start_time = end_time - timedelta(hours=1)
# This will also fail with same dimension mismatch
fs.materialize(
start_date=start_time,
end_date=end_time,
feature_views=["document_embeddings"]
)Investigation Results
Data Validation
Our debugging confirmed:
- ✅ Input embeddings are exactly 384 dimensions each
- ✅ DataFrame contains 5 rows × 384 = 1920 total elements
- ✅ Embeddings converted to Python lists correctly
- ✅ Data types are correct (
Array(Float32)) - ❌ Feast somehow transforms 1920 → 7695 elements internally
Affected Methods
- Push API:
store.push()withPushMode.ONLINE_AND_OFFLINE - Materialization:
store.materialize()from parquet files - Both fail with identical dimension mismatch errors
Expected Fix
Feast should correctly handle Array(Float32) fields when:
- Pushing data via push API
- Materializing data from parquet files
- The dimension transformation logic needs debugging/fixing
Potential Root Cause
The issue appears to be in Feast's internal serialization/transformation of Array(Float32) fields when interfacing with Milvus. The ~4x multiplication factor (1920 → 7695) suggests there might be:
- Incorrect flattening of nested arrays
- Multiple serialization passes
- Data type conversion issues in the Milvus online store adapter
Workaround
Currently using direct pymilvus.MilvusClient integration which works perfectly with the same data, confirming the issue is within Feast's Milvus adapter.