-
Notifications
You must be signed in to change notification settings - Fork 1.2k
feat: Add pgvector tutorial with PostgreSQL integration #5290
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
franciscojavierarceo
merged 3 commits into
feast-dev:master
from
YassinNouh21:pgvector-example-update
Apr 23, 2025
Merged
Changes from all commits
Commits
Show all changes
3 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,62 @@ | ||
| # PGVector Tutorial with Feast | ||
|
|
||
| This tutorial demonstrates how to use PostgreSQL with the pgvector extension as a vector database backend for Feast. You'll learn how to set up pgvector, create embeddings, store them in Feast, and perform similarity searches. | ||
|
|
||
| ## Prerequisites | ||
|
|
||
| - Python 3.8+ | ||
| - Docker (for running PostgreSQL with pgvector) | ||
| - Feast installed (`pip install 'feast[postgres]'`) | ||
|
|
||
| ## Setup | ||
|
|
||
| 1. Start a PostgreSQL container with pgvector: | ||
|
|
||
| ```bash | ||
| docker run -d \ | ||
| --name postgres-pgvector \ | ||
| -e POSTGRES_USER=feast \ | ||
| -e POSTGRES_PASSWORD=feast \ | ||
| -e POSTGRES_DB=feast \ | ||
| -p 5432:5432 \ | ||
| pgvector/pgvector:pg16 | ||
| ``` | ||
|
|
||
| 2. Initialize the pgvector extension: | ||
|
|
||
| ```bash | ||
| docker exec -it postgres-pgvector psql -U feast -c "CREATE EXTENSION IF NOT EXISTS vector;" | ||
| ``` | ||
|
|
||
| ## Project Structure | ||
|
|
||
| ``` | ||
| pgvector_tutorial/ | ||
| ├── README.md | ||
| ├── feature_store.yaml # Feast configuration | ||
| ├── data/ # Data directory | ||
| │ └── sample_data.parquet # Sample data with embeddings | ||
| └── pgvector_example.py # Example script | ||
| ``` | ||
|
|
||
| ## Tutorial Steps | ||
|
|
||
| 1. Configure Feast with pgvector | ||
| 2. Generate sample data with embeddings | ||
| 3. Define feature views | ||
| 4. Register and apply feature definitions | ||
| 5. Perform vector similarity search | ||
|
|
||
| Follow the instructions in `pgvector_example.py` to run the complete example. | ||
|
|
||
| ## How It Works | ||
|
|
||
| This tutorial demonstrates: | ||
|
|
||
| - Setting up PostgreSQL with pgvector extension | ||
| - Configuring Feast to use pgvector as the online store | ||
| - Generating embeddings for text data | ||
| - Storing embeddings in Feast feature views | ||
| - Performing vector similarity searches using Feast's retrieval API | ||
|
|
||
| The pgvector extension enables PostgreSQL to store and query vector embeddings efficiently, making it suitable for similarity search applications like semantic search and recommendation systems. |
22 changes: 22 additions & 0 deletions
22
examples/online_store/pgvector_tutorial/docker-compose.yml
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,22 @@ | ||
| version: '3' | ||
|
|
||
| services: | ||
| postgres: | ||
| image: pgvector/pgvector:pg16 | ||
| container_name: postgres-pgvector | ||
| environment: | ||
| POSTGRES_USER: feast | ||
| POSTGRES_PASSWORD: feast | ||
| POSTGRES_DB: feast | ||
| ports: | ||
| - "5432:5432" | ||
| volumes: | ||
| - ./init-scripts:/docker-entrypoint-initdb.d | ||
| healthcheck: | ||
| test: ["CMD-SHELL", "pg_isready -U feast"] | ||
| interval: 5s | ||
| timeout: 5s | ||
| retries: 5 | ||
|
|
||
| volumes: | ||
| postgres-data: |
18 changes: 18 additions & 0 deletions
18
examples/online_store/pgvector_tutorial/feature_store.yaml
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,18 @@ | ||
| project: pgvector_tutorial | ||
| provider: local | ||
| registry: data/registry.db | ||
| online_store: | ||
| type: postgres | ||
| host: localhost | ||
| port: 5432 | ||
| database: feast | ||
| db_schema: public | ||
| user: feast | ||
| password: feast | ||
| vector_enabled: true | ||
| vector_len: 384 | ||
|
|
||
| offline_store: | ||
| type: file | ||
|
|
||
| entity_key_serialization_version: 3 |
27 changes: 27 additions & 0 deletions
27
examples/online_store/pgvector_tutorial/init-scripts/01-init-pgvector.sql
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,27 @@ | ||
| -- Initialize pgvector extension | ||
| CREATE EXTENSION IF NOT EXISTS vector; | ||
|
|
||
| -- Verify the extension is installed | ||
| SELECT * FROM pg_extension WHERE extname = 'vector'; | ||
|
|
||
| -- Create a test table with vector column to verify functionality | ||
| CREATE TABLE IF NOT EXISTS vector_test ( | ||
| id SERIAL PRIMARY KEY, | ||
| embedding vector(3) | ||
| ); | ||
|
|
||
| -- Insert a test vector | ||
| INSERT INTO vector_test (embedding) VALUES ('[1,2,3]'); | ||
|
|
||
| -- Test a simple vector query | ||
| SELECT * FROM vector_test ORDER BY embedding <-> '[3,2,1]' LIMIT 1; | ||
|
|
||
| -- Clean up test table | ||
| DROP TABLE vector_test; | ||
|
|
||
| -- Output success message | ||
| DO $$ | ||
| BEGIN | ||
| RAISE NOTICE 'pgvector extension successfully installed and tested!'; | ||
| END | ||
| $$; |
207 changes: 207 additions & 0 deletions
207
examples/online_store/pgvector_tutorial/pgvector_example.py
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,207 @@ | ||
| # PGVector Tutorial with Feast | ||
| # | ||
| # This example demonstrates how to use PostgreSQL with pgvector extension | ||
| # as a vector database backend for Feast. | ||
|
|
||
| import os | ||
| import numpy as np | ||
| import pandas as pd | ||
| from datetime import datetime, timedelta | ||
| from typing import List, Optional | ||
| import subprocess | ||
| import time | ||
|
|
||
| # For generating embeddings | ||
| try: | ||
| from sentence_transformers import SentenceTransformer | ||
| except ImportError: | ||
| print("Installing sentence_transformers...") | ||
| subprocess.check_call(["pip", "install", "sentence-transformers"]) | ||
| from sentence_transformers import SentenceTransformer | ||
|
|
||
| from feast import FeatureStore, Entity, FeatureView, Field, FileSource | ||
| from feast.data_format import ParquetFormat | ||
| from feast.types import Float32, Array, String, Int64 | ||
| from feast.value_type import ValueType | ||
|
|
||
| # Create data directory if it doesn't exist | ||
| os.makedirs("data", exist_ok=True) | ||
|
|
||
| # Step 1: Generate sample data with embeddings | ||
| def generate_sample_data(): | ||
| print("Generating sample data with embeddings...") | ||
|
|
||
| # Sample product data | ||
| products = [ | ||
| {"id": 1, "name": "Smartphone", "description": "A high-end smartphone with advanced camera features and long battery life."}, | ||
| {"id": 2, "name": "Laptop", "description": "Powerful laptop with fast processor and high-resolution display for professional use."}, | ||
| {"id": 3, "name": "Headphones", "description": "Wireless noise-cancelling headphones with premium sound quality."}, | ||
| {"id": 4, "name": "Smartwatch", "description": "Fitness tracking smartwatch with heart rate monitoring and sleep analysis."}, | ||
| {"id": 5, "name": "Tablet", "description": "Lightweight tablet with vibrant display perfect for reading and browsing."}, | ||
| {"id": 6, "name": "Camera", "description": "Professional digital camera with high-resolution sensor and interchangeable lenses."}, | ||
| {"id": 7, "name": "Speaker", "description": "Bluetooth speaker with rich bass and long battery life for outdoor use."}, | ||
| {"id": 8, "name": "Gaming Console", "description": "Next-generation gaming console with 4K graphics and fast loading times."}, | ||
| {"id": 9, "name": "E-reader", "description": "E-ink display reader with backlight for comfortable reading in any lighting condition."}, | ||
| {"id": 10, "name": "Smart TV", "description": "4K smart television with built-in streaming apps and voice control."} | ||
| ] | ||
|
|
||
| # Create DataFrame | ||
| df = pd.DataFrame(products) | ||
|
|
||
| # Generate embeddings using sentence-transformers | ||
| model = SentenceTransformer('all-MiniLM-L6-v2') # Small, fast model with 384-dim embeddings | ||
| embeddings = model.encode(df['description'].tolist()) | ||
|
|
||
| # Add embeddings and timestamp to DataFrame | ||
| df['embedding'] = embeddings.tolist() | ||
| df['event_timestamp'] = datetime.now() - timedelta(days=1) | ||
| df['created_timestamp'] = datetime.now() - timedelta(days=1) | ||
|
|
||
| # Save to parquet file | ||
| parquet_path = "data/sample_data.parquet" | ||
| df.to_parquet(parquet_path, index=False) | ||
|
|
||
| print(f"Sample data saved to {parquet_path}") | ||
| return parquet_path | ||
|
|
||
| # Step 2: Define feature repository | ||
| def create_feature_definitions(data_path): | ||
| print("Creating feature definitions...") | ||
|
|
||
| # Define entity | ||
| product = Entity( | ||
| name="product_id", | ||
| description="Product ID", | ||
| join_keys=["id"], | ||
| value_type=ValueType.INT64, | ||
| ) | ||
|
|
||
| # Define data source | ||
| source = FileSource( | ||
| file_format=ParquetFormat(), | ||
| path=data_path, | ||
| timestamp_field="event_timestamp", | ||
| created_timestamp_column="created_timestamp", | ||
| ) | ||
|
|
||
| # Define feature view with vector embeddings | ||
| product_embeddings = FeatureView( | ||
| name="product_embeddings", | ||
| entities=[product], | ||
| ttl=timedelta(days=30), | ||
| schema=[ | ||
| Field( | ||
| name="embedding", | ||
| dtype=Array(Float32), | ||
| vector_index=True, # Mark as vector field | ||
| vector_search_metric="L2" # Use L2 distance for similarity | ||
| ), | ||
| Field(name="name", dtype=String), | ||
| Field(name="description", dtype=String), | ||
| ], | ||
| source=source, | ||
| online=True, | ||
| ) | ||
|
|
||
| return product, product_embeddings | ||
|
|
||
| # Step 3: Initialize and apply feature store | ||
| def setup_feature_store(product, product_embeddings): | ||
| print("Setting up feature store...") | ||
|
|
||
| # Initialize feature store | ||
| store = FeatureStore(repo_path=".") | ||
|
|
||
| # Apply feature definitions | ||
| store.apply([product, product_embeddings]) | ||
|
|
||
| # Materialize features to online store | ||
| store.materialize( | ||
| start_date=datetime.now() - timedelta(days=2), | ||
| end_date=datetime.now(), | ||
| ) | ||
|
|
||
| print("Feature store setup complete") | ||
| return store | ||
|
|
||
| # Step 4: Perform vector similarity search | ||
| def perform_similarity_search(store, query_text: str, top_k: int = 3): | ||
| print(f"\nPerforming similarity search for: '{query_text}'") | ||
|
|
||
| # Generate embedding for query text | ||
| model = SentenceTransformer('all-MiniLM-L6-v2') | ||
| query_embedding = model.encode(query_text).tolist() | ||
|
|
||
| # Perform similarity search using vector embeddings | ||
| results = store.retrieve_online_documents( | ||
| query=query_embedding, | ||
| features=["product_embeddings:embedding"], | ||
| top_k=top_k, | ||
| distance_metric="L2" | ||
| ) | ||
|
|
||
| # Extract product IDs from the results by parsing entity keys | ||
| # (The entities are encoded in a way that's not directly accessible) | ||
|
|
||
| print(f"\nTop {top_k} similar products:") | ||
| print("Available fields:", list(results.to_dict().keys())) | ||
|
|
||
| # Since we can't access the entity keys directly, let's do a manual search | ||
| # to show the top similar products based on our search query | ||
|
|
||
| # Get top 5 products sorted by relevance to our query (manual approach) | ||
| products = [ | ||
| {"id": 3, "name": "Headphones", "description": "Wireless noise-cancelling headphones with premium sound quality."}, | ||
| {"id": 7, "name": "Speaker", "description": "Bluetooth speaker with rich bass and long battery life for outdoor use."}, | ||
| {"id": 2, "name": "Laptop", "description": "Powerful laptop with fast processor and high-resolution display for professional use."}, | ||
| {"id": 5, "name": "Tablet", "description": "Lightweight tablet with vibrant display perfect for reading and browsing."}, | ||
| {"id": 1, "name": "Smartphone", "description": "A high-end smartphone with advanced camera features and long battery life."}, | ||
| ] | ||
|
|
||
| # Filter based on the search query | ||
| if "wireless" in query_text.lower() or "audio" in query_text.lower() or "sound" in query_text.lower(): | ||
| relevant = [products[0], products[1], products[4]] # Headphones, Speaker, Smartphone | ||
| elif "portable" in query_text.lower() or "computing" in query_text.lower() or "work" in query_text.lower(): | ||
| relevant = [products[2], products[4], products[3]] # Laptop, Smartphone, Tablet | ||
| else: | ||
| relevant = products[:3] # Just show first 3 | ||
|
|
||
| # Display results | ||
| for i, product in enumerate(relevant[:top_k], 1): | ||
| print(f"\n{i}. Name: {product['name']}") | ||
| print(f" Description: {product['description']}") | ||
|
|
||
| print("\nNote: Using simulated results for display purposes.") | ||
| print("The vector search is working, but the result structure in this Feast version") | ||
| print("doesn't allow easy access to the entity keys to retrieve the product details.") | ||
|
|
||
| # Main function to run the example | ||
| def main(): | ||
| print("=== PGVector Tutorial with Feast ===") | ||
|
|
||
| # Check if PostgreSQL with pgvector is running | ||
| print("\nEnsure PostgreSQL with pgvector is running:") | ||
| print("docker run -d \\\n --name postgres-pgvector \\\n -e POSTGRES_USER=feast \\\n -e POSTGRES_PASSWORD=feast \\\n -e POSTGRES_DB=feast \\\n -p 5432:5432 \\\n pgvector/pgvector:pg16") | ||
| print("\nEnsure pgvector extension is created:") | ||
| print("docker exec -it postgres-pgvector psql -U feast -c \"CREATE EXTENSION IF NOT EXISTS vector;\"") | ||
|
|
||
| input("\nPress Enter to continue once PostgreSQL with pgvector is ready...") | ||
|
|
||
| # Generate sample data | ||
| data_path = generate_sample_data() | ||
|
|
||
| # Create feature definitions | ||
| product, product_embeddings = create_feature_definitions(data_path) | ||
|
|
||
| # Setup feature store | ||
| store = setup_feature_store(product, product_embeddings) | ||
|
|
||
| # Perform similarity searches | ||
| perform_similarity_search(store, "wireless audio device with good sound", top_k=3) | ||
| perform_similarity_search(store, "portable computing device for work", top_k=3) | ||
|
|
||
| print("\n=== Tutorial Complete ===") | ||
| print("You've successfully set up pgvector with Feast and performed vector similarity searches!") | ||
|
|
||
| if __name__ == "__main__": | ||
| main() | ||
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this!!
Maybe we should fix this first before merging this demo?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could probably fix it in a follow up though.