Skip to content

add rag and examples#53

Open
longXboy wants to merge 5 commits into
mainfrom
feat/rag
Open

add rag and examples#53
longXboy wants to merge 5 commits into
mainfrom
feat/rag

Conversation

@longXboy
Copy link
Copy Markdown
Member

No description provided.

Copilot AI review requested due to automatic review settings October 16, 2025 07:22
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces a comprehensive RAG (Retrieval-Augmented Generation) implementation with document chunking, BM25-based retrieval, reranking strategies, and a complete example pipeline demonstrating integration with the flow graph system.

Key changes:

  • Adds core RAG types and interfaces (Document, Indexer, Retriever, Reranker) to the root package
  • Implements BM25 scoring algorithm for text-based document retrieval
  • Provides in-memory document stores with both basic and vector-ready implementations
  • Includes text chunking strategies (fixed-size and sentence-based) with Unicode support

Reviewed Changes

Copilot reviewed 16 out of 17 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
rag.go Core RAG type definitions and option functions
rag/rag.go Type aliases exporting core RAG interfaces from root package
rag/README.md Documentation describing RAG components and usage patterns
rag/chunking/chunking.go Text chunking implementations with Unicode-aware splitting
rag/chunking/chunking_test.go Tests for chunking strategies including Unicode handling
rag/retrieval/bm25.go BM25 scoring algorithm implementation
rag/retrieval/bm25_test.go Tests for BM25 scorer functionality
rag/retrieval/util.go Text tokenization utility function
rag/retrieval/reranker.go Document reranking strategies (cross-encoder, LLM, RRF)
rag/store/memory.go In-memory document store with BM25 retrieval
rag/store/memory_test.go Tests for in-memory store operations
rag/store/vector.go Vector store implementation with BM25 fallback
rag/store/util.go Metadata filtering utility
examples/rag/main.go Complete RAG pipeline example using flow graph
examples/rag/nodes.go Node implementations for RAG pipeline stages
examples/go.mod Dependency version update

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Comment thread rag/retrieval/bm25.go Outdated
// 第二遍:计算 IDF 值
for term, df := range s.docFreq {
// IDF = log((N - df + 0.5) / (df + 0.5) + 1)
s.idf[term] = math.Log((float64(s.docCount)-float64(df)+0.5)/(float64(df)+0.5) + 1.0)
Copy link

Copilot AI Oct 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idf map is written during Index() and read during Score() without synchronization. If Index() is called concurrently with Score(), this creates a race condition. Consider protecting idf, docFreq, docLens, avgDocLen, and docCount with a mutex, or document that Index() must not be called concurrently with Score().

Copilot uses AI. Check for mistakes.
Comment thread rag/store/memory.go Outdated
Comment on lines +47 to +52
// 重建 BM25 索引
allDocs := make([]rag.Document, 0, len(s.docs))
for _, doc := range s.docs {
allDocs = append(allDocs, doc)
}
s.bm25.Index(allDocs)
Copy link

Copilot AI Oct 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This index rebuilding logic is duplicated in both Add() and Delete() methods. Consider extracting it into a private helper method like rebuildIndex() to reduce duplication and improve maintainability.

Copilot uses AI. Check for mistakes.
Comment thread rag/store/vector.go Outdated
Comment on lines +48 to +53
// 重建 BM25 索引
allDocs := make([]rag.Document, 0, len(s.docs))
for _, doc := range s.docs {
allDocs = append(allDocs, doc)
}
s.bm25.Index(allDocs)
Copy link

Copilot AI Oct 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This index rebuilding logic is duplicated in both Add() and Delete() methods. Consider extracting it into a private helper method like rebuildIndex() to reduce duplication and improve maintainability.

Copilot uses AI. Check for mistakes.
Comment thread rag/retrieval/reranker.go Outdated
for _, results := range resultLists {
for rank, doc := range results {
// RRF 公式: score = 1 / (k + rank)
rrfScore := 1.0 / float64(r.k+rank+1)
Copy link

Copilot AI Oct 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The RRF formula should use rank (0-based index), not rank+1. The standard RRF formula is 1/(k + rank) where rank starts at 0. Adding 1 shifts all rankings and produces incorrect scores.

Suggested change
rrfScore := 1.0 / float64(r.k+rank+1)
rrfScore := 1.0 / float64(r.k+rank)

Copilot uses AI. Check for mistakes.
Comment thread rag/chunking/chunking.go Outdated
}

// 确保至少前进到下一个有意义的位置
if nextStart <= start {
Copy link

Copilot AI Oct 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When nextStart is set to 0 after being negative, and start is already 0, the condition on line 82 will set nextStart = end, causing the overlap to be ignored. This creates chunks without the intended overlap. The logic should ensure forward progress while respecting overlap.

Suggested change
if nextStart <= start {
if nextStart <= start && start != 0 {

Copilot uses AI. Check for mistakes.
Copilot AI review requested due to automatic review settings October 20, 2025 06:02
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 7 out of 8 changed files in this pull request and generated 2 comments.


Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Comment thread rag/rag.go Outdated
Retrieve(ctx context.Context, query string, opts ...RetrieveOption) ([]Document, error)
}

// Reranker 接口负责对初检索结果进行重排序,提升相关性。
Copy link

Copilot AI Oct 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing space after comma in Chinese comment. Should be '对初检索结果进行重排序, 提升相关性。' (space after comma).

Suggested change
// Reranker 接口负责对初检索结果进行重排序提升相关性。
// Reranker 接口负责对初检索结果进行重排序, 提升相关性。

Copilot uses AI. Check for mistakes.
Comment thread rag.go Outdated
// Retriever 接口负责根据请求检索相关文档。
type Retriever = rag.Retriever

// Reranker 接口负责对初检索结果进行重排序,提升相关性。
Copy link

Copilot AI Oct 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing space after comma in Chinese comment. Should be '对初检索结果进行重排序, 提升相关性。' (space after comma).

Suggested change
// Reranker 接口负责对初检索结果进行重排序,提升相关性。
// Reranker 接口负责对初检索结果进行重排序, 提升相关性。

Copilot uses AI. Check for mistakes.
Copilot AI review requested due to automatic review settings October 20, 2025 09:25
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 12 out of 13 changed files in this pull request and generated 1 comment.


Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Comment thread template.go
if i >= len(clone.tmpls) {
break
}
clone.tmpls[i].vars = data
Copy link

Copilot AI Oct 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While Clone() creates a deep copy of template pointers, the vars field (type any) is assigned directly without being copied. If vars contains mutable data structures (maps, slices, pointers), concurrent modifications across goroutines could still cause race conditions. Consider documenting this limitation or implementing deep copying for vars when it contains known mutable types.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants