1

I have an instance of Postgres with the pgvector extension enabled. I want to know if I can easily perform hybrid search on my data using both a vector similarity search as well as keyword matching. Other vector databases like Vespa and Pinecone I believe offer this natively.

Postgres with pgvector do not offer that natively (you can combine separate lexical and semantic searches, then rerank) but I found this Python library called vecs (see here official docs and Github). They offer a client that allows you to basically use Postgres similar to Pinecone but I cannot find how you do a hybrid search directly with this library. Does anyone know?

1
  • 2
    "recommendations for software libraries, tutorials, tools, books, or other off-site resources"? It asks a very specific question of "[How do I do] hybrid search on Postgres with pgvector using vecs". Since vecs wraps pgvector and you can do hybrid search with that, it's a reasonable expectation that it might be possible. I don't think an MRE is a requirement here either. I think it's suitable for SO, and the fact that some version of this question could be also suitable for dba, stats, softwareengineering, cs, datascience and softwarerecs is no reason to close Commented Oct 21 at 12:17

1 Answer 1

1

vecs GitHub issue #88 seems to suggest that it's not supported and it won't be simply because the vendor/maintainer, Supabase, meant it as semantic-only.

vecs is a lib for semantic search. If you're interested in hybrid search I'd suggest following out hybrid search docs which explain how to create an appropriate SQL function and access it via the REST API

And following that link leads to Supabase's take on pgvector's hybrid search example, which runs two separate searches, one using full-text search GIN index, another using HNSW, then combines and reranks the results:

WITH semantic_search AS (
    SELECT id, RANK () OVER (ORDER BY embedding <=> %(embedding)s) AS rank
    FROM documents
    ORDER BY embedding <=> %(embedding)s
    LIMIT 20
),
keyword_search AS (
    SELECT id, RANK()OVER(ORDER BY ts_rank_cd(to_tsvector('english',content),query) DESC)
    FROM documents, plainto_tsquery('english', %(query)s) query
    WHERE to_tsvector('english', content) @@ query
    ORDER BY ts_rank_cd(to_tsvector('english', content), query) DESC
    LIMIT 20
)
SELECT
    COALESCE(semantic_search.id, keyword_search.id) AS id,
    COALESCE(1.0 / (%(k)s + semantic_search.rank), 0.0) +
    COALESCE(1.0 / (%(k)s + keyword_search.rank), 0.0) AS score
FROM semantic_search
FULL OUTER JOIN keyword_search ON semantic_search.id = keyword_search.id
ORDER BY score DESC
LIMIT 5

Pinecone seems to recommend the same thing:

dense_results = dense_index.search(
    namespace="example-namespace",
    query={ "top_k": 40,
            "inputs": {"text": query}
    }
)
sparse_results = sparse_index.search(
    namespace="example-namespace",
    query={ "top_k": 40,
            "inputs": {"text": query}
    }
)

def merge_chunks(h1, h2):
    """Get the unique hits from two search results and return them as single array of {'_id', 'chunk_text'} dicts, printing each dict on a new line."""
    # Deduplicate by _id
    deduped_hits = {hit['_id']: hit for hit in h1['result']['hits'] + h2['result']['hits']}.values()
    # Sort by _score descending
    sorted_hits = sorted(deduped_hits, key=lambda x: x['_score'], reverse=True)
    # Transform to format for reranking
    result = [{'_id': hit['_id'], 'chunk_text': hit['fields']['chunk_text']} for hit in sorted_hits]
    return result

merged_results = merge_chunks(sparse_results, dense_results)
result = pc.inference.rerank(
    model="bge-reranker-v2-m3",
    query=query,
    documents=merged_results,
    rank_fields=["chunk_text"],
    top_n=10,
    return_documents=True,
    parameters={"truncate": "END"}
)

That's what Vespa does as well:

These top-k query operators use index structures to accelerate the query evaluation, avoiding scoring all documents using heuristics. In the context of hybrid text search, the following Vespa top-k query operators are relevant:

  • YQL {targetHits:k}nearestNeighbor() for dense representations (text embeddings) using a configured distance-metric as the scoring function.
  • YQL {targetHits:k}userInput(@user-query) which by default uses weakAnd for sparse representations.

We can combine these operators using boolean query operators like AND/OR/RANK to express a hybrid search query. Then, there is a wild number of ways that we can combine various signals in ranking.

Sign up to request clarification or add additional context in comments.

2 Comments

Thank you! that is actually helpful, and yes I agree with you about closing questions unnecessarily. Like do they want people to use stack overflow or not? LLMs already hurt stack overflow, the only advantage stack overflow has is the community. If the community is aggressive then all is lost.
It's a good topic for Meta SE. Some of this probably is just increasing toxicity of the community, but I wouldn't be surprised if some were farming rare moderation badges/achievements or trying to steer users away from SO and towards (their preferred) other SE sites. I've also seen a lot of variability in what sort of crowd different tags tend to attract - newer/more niche/less popular ones tend to be more welcoming. They give you less exposure and reduce chances you'll attract someone with an answer but you definitely avoid generalist know-it-alls on moderation-by-downvoting duty

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.