FinanceBench RAG Agent

A multi-agent RAG system for role-based access-controlled financial document Q&A. Achieves 72.7% correctness pass rate on the public FinanceBench benchmark using selective agentic retrieval, a LoRA-fine-tuned reranker, and a self-hosted LLM observability stack.

Architecture

flowchart TD
    Q([Query + JWT]) --> RBAC[rbac_gate<br/>JWT to Qdrant filter]
    RBAC --> Guard[guardrails<br/>regex to LLM Guard to LLM classifier]
    Guard -->|blocked| Block([blocked])
    Guard --> Route{router}
    Route -->|simple_lookup| Direct[retrieval → reranker → grader → generator]
    Route -->|research_required| Agent[[research_agent subgraph<br/>decompose → retrieve → grade → sufficiency → synthesize<br/>5-turn cap]]
    Direct --> Halu[hallucination_checker]
    Agent --> Halu
    Halu -->|ungrounded, retry up to 2| Direct
    Halu --> HITL{hitl_gate}
    HITL -->|amount above role threshold| Pause([pause for human approval])
    HITL --> Out([Answer + sources])

A router classifies each query as a simple lookup or research-required. Simple lookups take the fast direct path; research queries enter a multi-turn subgraph that decomposes the question, retrieves per sub-question, grades sufficiency, and synthesizes a final answer. RBAC is enforced at the Qdrant payload-filter level — agentic queries cannot bypass access control. High-stakes answers (above a per-role dollar threshold) pause via LangGraph's interrupt() for human approval, with state checkpointed to Postgres.

Tech stack

Backend — FastAPI · LangGraph · Qdrant · PostgreSQL · Redis · PyJWT
Frontend — Next.js 16 · React 19 · Tailwind · shadcn/ui (in progress; Gradio is the current usable UI)
LLMs — Claude Sonnet 4.6 · gpt-4o-mini · Llama 3.3 (via Groq)
Retrieval — voyage-finance-2 embeddings · LoRA-fine-tuned BGE-reranker-v2-m3
Observability — self-hosted LiteLLM proxy + Langfuse v3 + Redis semantic cache
Safety — Microsoft Presidio PII detection · LLM Guard · LLM classifier (3-layer cascade)
Evaluation — RAGAS · DeepEval · custom LLM correctness judge

Evaluation results

Evaluated on the FinanceBench benchmark (150 questions across 32 companies):

Metric	Value
Correctness pass rate	72.7% (109/150)
Refusal rate	6.7% (10/150)
RAGAS faithfulness	0.747
DeepEval faithfulness	0.844
DeepEval contextual recall	0.768

Per-slice pass rate: lookup 68.6% (n=86), multi-hop 84.6% (n=13), calc 76.5% (n=51).

The correctness judge is a Claude Sonnet 4.6 + structured-prompt setup calibrated to Cohen's κ = 0.932 against an 89-question hand-labeled set with an adversarial leniency guard. The evaluation pipeline uses three judges in parallel (RAGAS, DeepEval, custom correctness), per-question diagnostics, reproducibility-metadata snapshots on every run, and a decision-gated approach in which each candidate intervention must clear an empirically-measured noise floor before shipping. Full methodology, per-judge scores, and reproduction commands in docs/evaluation.md.

Known limitations

Not deployed to production — runs locally via docker compose up -d. No public URL or live traffic.
Frontend is a vertical slice — login + streaming chat work; sidebar, HITL UI, admin panel, citation PDF viewer are unbuilt.
Below the top-published Mafin (~99%) on FinanceBench, though above FinanceBench paper baselines (38–43%) and FinGEAR EMNLP 2025 GraphRAG (~55%).

Quick start

git clone https://github.com/Rishabhmannu/financebench-rag-agent.git
cd financebench-rag-agent
pip install -e ".[dev]" && cp .env.example .env   # add your API keys
docker compose up -d && make run                  # API at http://localhost:8000

Full setup, test accounts, dev commands, and API surface in docs/setup.md.

Documentation

docs/evaluation.md — Methodology, results, reproduction
docs/engineering-log.md — Engineering decisions and tradeoffs
docs/setup.md — Local development, test accounts, environment
docs/architecture.md · docs/api-reference.md · docs/rbac-matrix.md · web/README.md

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 57 Commits
.github/workflows		.github/workflows
data		data
docs		docs
migrations		migrations
scripts		scripts
src		src
tests		tests
web		web
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
Dockerfile		Dockerfile
Makefile		Makefile
README.md		README.md
alembic.ini		alembic.ini
docker-compose.yml		docker-compose.yml
litellm_config.yaml		litellm_config.yaml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FinanceBench RAG Agent

Architecture

Tech stack

Evaluation results

Known limitations

Quick start

Documentation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

FinanceBench RAG Agent

Architecture

Tech stack

Evaluation results

Known limitations

Quick start

Documentation

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages