ray_rag

Feast Ray RAG Template - Batch Embedding at scale for RAG with Ray

RAG (Retrieval-Augmented Generation) template using Feast with Ray for distributed processing and Milvus for vector search.

🚀 What This Template Provides

🎬 Sample IMDB Data: 10 curated movies included for quick demos
⚡ Ray Distributed Processing: Parallel embedding generation across workers
🔍 Vector Search: Milvus integration for semantic similarity
🎯 Complete Pipeline: Data → Embeddings → Search in one workflow
📦 Ready to Scale: Easy upgrade to full dataset (48K+ movies) if needed

📁 Template Structure

ray_rag/
├── feature_repo/
│   ├── feature_store.yaml      # Ray + Milvus configuration
│   ├── feature_definitions.py  # Feature definitions with Ray UDF
│   ├── test_workflow.py        # End-to-end demo
│   └── data/                   
│       └── raw_movies.parquet  # Sample IMDB dataset (10 movies)
├── bootstrap.py                # Template initialization
└── README.md

🚦 Quick Start

1. Initialize Template

feast init -t ray_rag my_rag_project
cd my_rag_project/feature_repo

The template includes a sample dataset with 10 movies for quick testing.

2. Install Dependencies

# Core dependencies
pip install feast[ray] sentence-transformers

3. Apply Feature Definitions

feast apply

4. Materialize Features

# Generate embeddings for sample movies
feast materialize --disable-event-timestamp

5. Test the Pipeline

python test_workflow.py

Expected output with sample dataset:

✅ 10 embeddings materialized
✅ Vector search working with relevant results
✅ Similarity scores for relevant matches

📊 Architecture

Raw Data (IMDB CSV)
    ↓
Ray Offline Store (Distributed I/O)
    ↓
Ray Compute Engine (Parallel Embedding Generation)
    ↓  
Milvus Online Store (Vector Search)
    ↓
RAG Application

🎬 Example Workflow

from feast import FeatureStore
from sentence_transformers import SentenceTransformer

# 1. Initialize
store = FeatureStore(repo_path=".")

# 2. Materialize (embeddings generated in parallel)
store.materialize_incremental(end_date)

# 3. Search using Feast API
model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
query_embedding = model.encode(["sci-fi movie about space"])[0].tolist()

results = store.retrieve_online_documents_v2(
    features=[
        "document_embeddings:embedding",
        "document_embeddings:movie_name",
        "document_embeddings:movie_director",
    ],
    query=query_embedding,
    top_k=5,
).to_dict()

# Display results with metadata
for i in range(len(results["document_id_pk"])):
    print(f"{i+1}. {results['movie_name'][i]}")
    print(f"   Director: {results['movie_director'][i]}")
    print(f"   Distance: {results['distance'][i]:.3f}")

📥 Using the Full IMDB Dataset (Optional)

The template includes a small sample dataset (10 movies) for quick testing. To work with the full dataset containing 48K+ movies:

Option 1: Download via Kaggle API

Setup Kaggle credentials:

# Get API credentials from https://www.kaggle.com/account
# Place kaggle.json in ~/.kaggle/
chmod 600 ~/.kaggle/kaggle.json

Install Kaggle API and download dataset:

pip install kaggle

# Download to your feature_repo/data directory
cd feature_repo
kaggle datasets download -d yashgupta24/48000-movies-dataset -p ./data --unzip

Convert to parquet format:

import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
from pathlib import Path

# Read the CSV file (filename may vary)
data_path = Path("./data")
csv_files = list(data_path.glob("*.csv"))
df = pd.read_csv(csv_files[0])

# Convert DatePublished to datetime with UTC timezone
df = df.dropna(subset=["DatePublished"])
df["DatePublished"] = pd.to_datetime(df["DatePublished"], errors="coerce", utc=True)

# Write to parquet
table = pa.Table.from_pandas(df)
pq.write_table(table, data_path / "raw_movies.parquet")
print(f"✅ Converted {len(df)} movies to parquet format")

Run the full pipeline:

feast apply
feast materialize --disable-event-timestamp
python test_workflow.py

Option 2: Use Your Own Dataset

Replace feature_repo/data/raw_movies.parquet with your own dataset. Required schema:

id: Unique identifier (string)
Name: Movie name (string)
Description: Movie description for embedding (string)
Director: Director name (string)
Genres: Comma-separated genres (string)
RatingValue: Rating score (float)
DatePublished: Publication date (datetime with UTC timezone)

The Ray embedding pipeline will automatically process your dataset in parallel.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

Feast Ray RAG Template - Batch Embedding at scale for RAG with Ray

🚀 What This Template Provides

📁 Template Structure

🚦 Quick Start

1. Initialize Template

2. Install Dependencies

3. Apply Feature Definitions

4. Materialize Features

5. Test the Pipeline

📊 Architecture

🎬 Example Workflow

📥 Using the Full IMDB Dataset (Optional)

Option 1: Download via Kaggle API

Option 2: Use Your Own Dataset

Name		Name	Last commit message	Last commit date
parent directory ..
feature_repo		feature_repo
README.md		README.md
__init__.py		__init__.py
bootstrap.py		bootstrap.py
gitignore		gitignore

FilesExpand file tree

ray_rag

Directory actions

More options

Directory actions

More options

Latest commit

History

ray_rag

Folders and files

parent directory

README.md

Feast Ray RAG Template - Batch Embedding at scale for RAG with Ray

🚀 What This Template Provides

📁 Template Structure

🚦 Quick Start

1. Initialize Template

2. Install Dependencies

3. Apply Feature Definitions

4. Materialize Features

5. Test the Pipeline

📊 Architecture

🎬 Example Workflow

📥 Using the Full IMDB Dataset (Optional)

Option 1: Download via Kaggle API

Option 2: Use Your Own Dataset