Conversation
There was a problem hiding this comment.
Pull Request Overview
This PR introduces a comprehensive RAG (Retrieval-Augmented Generation) implementation with document chunking, BM25-based retrieval, reranking strategies, and a complete example pipeline demonstrating integration with the flow graph system.
Key changes:
- Adds core RAG types and interfaces (
Document,Indexer,Retriever,Reranker) to the root package - Implements BM25 scoring algorithm for text-based document retrieval
- Provides in-memory document stores with both basic and vector-ready implementations
- Includes text chunking strategies (fixed-size and sentence-based) with Unicode support
Reviewed Changes
Copilot reviewed 16 out of 17 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| rag.go | Core RAG type definitions and option functions |
| rag/rag.go | Type aliases exporting core RAG interfaces from root package |
| rag/README.md | Documentation describing RAG components and usage patterns |
| rag/chunking/chunking.go | Text chunking implementations with Unicode-aware splitting |
| rag/chunking/chunking_test.go | Tests for chunking strategies including Unicode handling |
| rag/retrieval/bm25.go | BM25 scoring algorithm implementation |
| rag/retrieval/bm25_test.go | Tests for BM25 scorer functionality |
| rag/retrieval/util.go | Text tokenization utility function |
| rag/retrieval/reranker.go | Document reranking strategies (cross-encoder, LLM, RRF) |
| rag/store/memory.go | In-memory document store with BM25 retrieval |
| rag/store/memory_test.go | Tests for in-memory store operations |
| rag/store/vector.go | Vector store implementation with BM25 fallback |
| rag/store/util.go | Metadata filtering utility |
| examples/rag/main.go | Complete RAG pipeline example using flow graph |
| examples/rag/nodes.go | Node implementations for RAG pipeline stages |
| examples/go.mod | Dependency version update |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
| // 第二遍:计算 IDF 值 | ||
| for term, df := range s.docFreq { | ||
| // IDF = log((N - df + 0.5) / (df + 0.5) + 1) | ||
| s.idf[term] = math.Log((float64(s.docCount)-float64(df)+0.5)/(float64(df)+0.5) + 1.0) |
There was a problem hiding this comment.
The idf map is written during Index() and read during Score() without synchronization. If Index() is called concurrently with Score(), this creates a race condition. Consider protecting idf, docFreq, docLens, avgDocLen, and docCount with a mutex, or document that Index() must not be called concurrently with Score().
| // 重建 BM25 索引 | ||
| allDocs := make([]rag.Document, 0, len(s.docs)) | ||
| for _, doc := range s.docs { | ||
| allDocs = append(allDocs, doc) | ||
| } | ||
| s.bm25.Index(allDocs) |
There was a problem hiding this comment.
This index rebuilding logic is duplicated in both Add() and Delete() methods. Consider extracting it into a private helper method like rebuildIndex() to reduce duplication and improve maintainability.
| // 重建 BM25 索引 | ||
| allDocs := make([]rag.Document, 0, len(s.docs)) | ||
| for _, doc := range s.docs { | ||
| allDocs = append(allDocs, doc) | ||
| } | ||
| s.bm25.Index(allDocs) |
There was a problem hiding this comment.
This index rebuilding logic is duplicated in both Add() and Delete() methods. Consider extracting it into a private helper method like rebuildIndex() to reduce duplication and improve maintainability.
| for _, results := range resultLists { | ||
| for rank, doc := range results { | ||
| // RRF 公式: score = 1 / (k + rank) | ||
| rrfScore := 1.0 / float64(r.k+rank+1) |
There was a problem hiding this comment.
The RRF formula should use rank (0-based index), not rank+1. The standard RRF formula is 1/(k + rank) where rank starts at 0. Adding 1 shifts all rankings and produces incorrect scores.
| rrfScore := 1.0 / float64(r.k+rank+1) | |
| rrfScore := 1.0 / float64(r.k+rank) |
| } | ||
|
|
||
| // 确保至少前进到下一个有意义的位置 | ||
| if nextStart <= start { |
There was a problem hiding this comment.
When nextStart is set to 0 after being negative, and start is already 0, the condition on line 82 will set nextStart = end, causing the overlap to be ignored. This creates chunks without the intended overlap. The logic should ensure forward progress while respecting overlap.
| if nextStart <= start { | |
| if nextStart <= start && start != 0 { |
There was a problem hiding this comment.
Pull Request Overview
Copilot reviewed 7 out of 8 changed files in this pull request and generated 2 comments.
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
| Retrieve(ctx context.Context, query string, opts ...RetrieveOption) ([]Document, error) | ||
| } | ||
|
|
||
| // Reranker 接口负责对初检索结果进行重排序,提升相关性。 |
There was a problem hiding this comment.
Missing space after comma in Chinese comment. Should be '对初检索结果进行重排序, 提升相关性。' (space after comma).
| // Reranker 接口负责对初检索结果进行重排序,提升相关性。 | |
| // Reranker 接口负责对初检索结果进行重排序, 提升相关性。 |
| // Retriever 接口负责根据请求检索相关文档。 | ||
| type Retriever = rag.Retriever | ||
|
|
||
| // Reranker 接口负责对初检索结果进行重排序,提升相关性。 |
There was a problem hiding this comment.
Missing space after comma in Chinese comment. Should be '对初检索结果进行重排序, 提升相关性。' (space after comma).
| // Reranker 接口负责对初检索结果进行重排序,提升相关性。 | |
| // Reranker 接口负责对初检索结果进行重排序, 提升相关性。 |
There was a problem hiding this comment.
Pull Request Overview
Copilot reviewed 12 out of 13 changed files in this pull request and generated 1 comment.
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
| if i >= len(clone.tmpls) { | ||
| break | ||
| } | ||
| clone.tmpls[i].vars = data |
There was a problem hiding this comment.
While Clone() creates a deep copy of template pointers, the vars field (type any) is assigned directly without being copied. If vars contains mutable data structures (maps, slices, pointers), concurrent modifications across goroutines could still cause race conditions. Consider documenting this limitation or implementing deep copying for vars when it contains known mutable types.
No description provided.