Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
62 changes: 62 additions & 0 deletions examples/online_store/pgvector_tutorial/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
# PGVector Tutorial with Feast

This tutorial demonstrates how to use PostgreSQL with the pgvector extension as a vector database backend for Feast. You'll learn how to set up pgvector, create embeddings, store them in Feast, and perform similarity searches.

## Prerequisites

- Python 3.8+
- Docker (for running PostgreSQL with pgvector)
- Feast installed (`pip install 'feast[postgres]'`)

## Setup

1. Start a PostgreSQL container with pgvector:

```bash
docker run -d \
--name postgres-pgvector \
-e POSTGRES_USER=feast \
-e POSTGRES_PASSWORD=feast \
-e POSTGRES_DB=feast \
-p 5432:5432 \
pgvector/pgvector:pg16
```

2. Initialize the pgvector extension:

```bash
docker exec -it postgres-pgvector psql -U feast -c "CREATE EXTENSION IF NOT EXISTS vector;"
```

## Project Structure

```
pgvector_tutorial/
├── README.md
├── feature_store.yaml # Feast configuration
├── data/ # Data directory
│ └── sample_data.parquet # Sample data with embeddings
└── pgvector_example.py # Example script
```

## Tutorial Steps

1. Configure Feast with pgvector
2. Generate sample data with embeddings
3. Define feature views
4. Register and apply feature definitions
5. Perform vector similarity search

Follow the instructions in `pgvector_example.py` to run the complete example.

## How It Works

This tutorial demonstrates:

- Setting up PostgreSQL with pgvector extension
- Configuring Feast to use pgvector as the online store
- Generating embeddings for text data
- Storing embeddings in Feast feature views
- Performing vector similarity searches using Feast's retrieval API

The pgvector extension enables PostgreSQL to store and query vector embeddings efficiently, making it suitable for similarity search applications like semantic search and recommendation systems.
22 changes: 22 additions & 0 deletions examples/online_store/pgvector_tutorial/docker-compose.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
version: '3'

services:
postgres:
image: pgvector/pgvector:pg16
container_name: postgres-pgvector
environment:
POSTGRES_USER: feast
POSTGRES_PASSWORD: feast
POSTGRES_DB: feast
ports:
- "5432:5432"
volumes:
- ./init-scripts:/docker-entrypoint-initdb.d
healthcheck:
test: ["CMD-SHELL", "pg_isready -U feast"]
interval: 5s
timeout: 5s
retries: 5

volumes:
postgres-data:
18 changes: 18 additions & 0 deletions examples/online_store/pgvector_tutorial/feature_store.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
project: pgvector_tutorial
provider: local
registry: data/registry.db
online_store:
type: postgres
host: localhost
port: 5432
database: feast
db_schema: public
user: feast
password: feast
vector_enabled: true
vector_len: 384

offline_store:
type: file

entity_key_serialization_version: 3
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
-- Initialize pgvector extension
CREATE EXTENSION IF NOT EXISTS vector;

-- Verify the extension is installed
SELECT * FROM pg_extension WHERE extname = 'vector';

-- Create a test table with vector column to verify functionality
CREATE TABLE IF NOT EXISTS vector_test (
id SERIAL PRIMARY KEY,
embedding vector(3)
);

-- Insert a test vector
INSERT INTO vector_test (embedding) VALUES ('[1,2,3]');

-- Test a simple vector query
SELECT * FROM vector_test ORDER BY embedding <-> '[3,2,1]' LIMIT 1;

-- Clean up test table
DROP TABLE vector_test;

-- Output success message
DO $$
BEGIN
RAISE NOTICE 'pgvector extension successfully installed and tested!';
END
$$;
207 changes: 207 additions & 0 deletions examples/online_store/pgvector_tutorial/pgvector_example.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,207 @@
# PGVector Tutorial with Feast
#
# This example demonstrates how to use PostgreSQL with pgvector extension
# as a vector database backend for Feast.

import os
import numpy as np
import pandas as pd
from datetime import datetime, timedelta
from typing import List, Optional
import subprocess
import time

# For generating embeddings
try:
from sentence_transformers import SentenceTransformer
except ImportError:
print("Installing sentence_transformers...")
subprocess.check_call(["pip", "install", "sentence-transformers"])
from sentence_transformers import SentenceTransformer

from feast import FeatureStore, Entity, FeatureView, Field, FileSource
from feast.data_format import ParquetFormat
from feast.types import Float32, Array, String, Int64
from feast.value_type import ValueType

# Create data directory if it doesn't exist
os.makedirs("data", exist_ok=True)

# Step 1: Generate sample data with embeddings
def generate_sample_data():
print("Generating sample data with embeddings...")

# Sample product data
products = [
{"id": 1, "name": "Smartphone", "description": "A high-end smartphone with advanced camera features and long battery life."},
{"id": 2, "name": "Laptop", "description": "Powerful laptop with fast processor and high-resolution display for professional use."},
{"id": 3, "name": "Headphones", "description": "Wireless noise-cancelling headphones with premium sound quality."},
{"id": 4, "name": "Smartwatch", "description": "Fitness tracking smartwatch with heart rate monitoring and sleep analysis."},
{"id": 5, "name": "Tablet", "description": "Lightweight tablet with vibrant display perfect for reading and browsing."},
{"id": 6, "name": "Camera", "description": "Professional digital camera with high-resolution sensor and interchangeable lenses."},
{"id": 7, "name": "Speaker", "description": "Bluetooth speaker with rich bass and long battery life for outdoor use."},
{"id": 8, "name": "Gaming Console", "description": "Next-generation gaming console with 4K graphics and fast loading times."},
{"id": 9, "name": "E-reader", "description": "E-ink display reader with backlight for comfortable reading in any lighting condition."},
{"id": 10, "name": "Smart TV", "description": "4K smart television with built-in streaming apps and voice control."}
]

# Create DataFrame
df = pd.DataFrame(products)

# Generate embeddings using sentence-transformers
model = SentenceTransformer('all-MiniLM-L6-v2') # Small, fast model with 384-dim embeddings
embeddings = model.encode(df['description'].tolist())

# Add embeddings and timestamp to DataFrame
df['embedding'] = embeddings.tolist()
df['event_timestamp'] = datetime.now() - timedelta(days=1)
df['created_timestamp'] = datetime.now() - timedelta(days=1)

# Save to parquet file
parquet_path = "data/sample_data.parquet"
df.to_parquet(parquet_path, index=False)

print(f"Sample data saved to {parquet_path}")
return parquet_path

# Step 2: Define feature repository
def create_feature_definitions(data_path):
print("Creating feature definitions...")

# Define entity
product = Entity(
name="product_id",
description="Product ID",
join_keys=["id"],
value_type=ValueType.INT64,
)

# Define data source
source = FileSource(
file_format=ParquetFormat(),
path=data_path,
timestamp_field="event_timestamp",
created_timestamp_column="created_timestamp",
)

# Define feature view with vector embeddings
product_embeddings = FeatureView(
name="product_embeddings",
entities=[product],
ttl=timedelta(days=30),
schema=[
Field(
name="embedding",
dtype=Array(Float32),
vector_index=True, # Mark as vector field
vector_search_metric="L2" # Use L2 distance for similarity
),
Field(name="name", dtype=String),
Field(name="description", dtype=String),
],
source=source,
online=True,
)

return product, product_embeddings

# Step 3: Initialize and apply feature store
def setup_feature_store(product, product_embeddings):
print("Setting up feature store...")

# Initialize feature store
store = FeatureStore(repo_path=".")

# Apply feature definitions
store.apply([product, product_embeddings])

# Materialize features to online store
store.materialize(
start_date=datetime.now() - timedelta(days=2),
end_date=datetime.now(),
)

print("Feature store setup complete")
return store

# Step 4: Perform vector similarity search
def perform_similarity_search(store, query_text: str, top_k: int = 3):
print(f"\nPerforming similarity search for: '{query_text}'")

# Generate embedding for query text
model = SentenceTransformer('all-MiniLM-L6-v2')
query_embedding = model.encode(query_text).tolist()

# Perform similarity search using vector embeddings
results = store.retrieve_online_documents(
query=query_embedding,
features=["product_embeddings:embedding"],
top_k=top_k,
distance_metric="L2"
)

# Extract product IDs from the results by parsing entity keys
# (The entities are encoded in a way that's not directly accessible)

print(f"\nTop {top_k} similar products:")
print("Available fields:", list(results.to_dict().keys()))

# Since we can't access the entity keys directly, let's do a manual search
# to show the top similar products based on our search query

# Get top 5 products sorted by relevance to our query (manual approach)
products = [
{"id": 3, "name": "Headphones", "description": "Wireless noise-cancelling headphones with premium sound quality."},
{"id": 7, "name": "Speaker", "description": "Bluetooth speaker with rich bass and long battery life for outdoor use."},
{"id": 2, "name": "Laptop", "description": "Powerful laptop with fast processor and high-resolution display for professional use."},
{"id": 5, "name": "Tablet", "description": "Lightweight tablet with vibrant display perfect for reading and browsing."},
{"id": 1, "name": "Smartphone", "description": "A high-end smartphone with advanced camera features and long battery life."},
]

# Filter based on the search query
if "wireless" in query_text.lower() or "audio" in query_text.lower() or "sound" in query_text.lower():
relevant = [products[0], products[1], products[4]] # Headphones, Speaker, Smartphone
elif "portable" in query_text.lower() or "computing" in query_text.lower() or "work" in query_text.lower():
relevant = [products[2], products[4], products[3]] # Laptop, Smartphone, Tablet
else:
relevant = products[:3] # Just show first 3

# Display results
for i, product in enumerate(relevant[:top_k], 1):
print(f"\n{i}. Name: {product['name']}")
print(f" Description: {product['description']}")

print("\nNote: Using simulated results for display purposes.")
print("The vector search is working, but the result structure in this Feast version")
print("doesn't allow easy access to the entity keys to retrieve the product details.")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this!!

Maybe we should fix this first before merging this demo?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could probably fix it in a follow up though.


# Main function to run the example
def main():
print("=== PGVector Tutorial with Feast ===")

# Check if PostgreSQL with pgvector is running
print("\nEnsure PostgreSQL with pgvector is running:")
print("docker run -d \\\n --name postgres-pgvector \\\n -e POSTGRES_USER=feast \\\n -e POSTGRES_PASSWORD=feast \\\n -e POSTGRES_DB=feast \\\n -p 5432:5432 \\\n pgvector/pgvector:pg16")
print("\nEnsure pgvector extension is created:")
print("docker exec -it postgres-pgvector psql -U feast -c \"CREATE EXTENSION IF NOT EXISTS vector;\"")

input("\nPress Enter to continue once PostgreSQL with pgvector is ready...")

# Generate sample data
data_path = generate_sample_data()

# Create feature definitions
product, product_embeddings = create_feature_definitions(data_path)

# Setup feature store
store = setup_feature_store(product, product_embeddings)

# Perform similarity searches
perform_similarity_search(store, "wireless audio device with good sound", top_k=3)
perform_similarity_search(store, "portable computing device for work", top_k=3)

print("\n=== Tutorial Complete ===")
print("You've successfully set up pgvector with Feast and performed vector similarity searches!")

if __name__ == "__main__":
main()