GitHub - JudgmentLabs/judgeval: The Continuous-Improvement Stack for Agents. Our environment data and evals power agent improvement and monitoring.

The Continuous-Improvement Stack for Agents

Detect failures, triage root causes, and ship fixes backed by production data.

Overview

Judgeval is an open-source Python SDK for agent improvement. It provides tracing and agent-judge evaluation for LLM-powered applications — so you can detect failures, understand what went wrong, and validate fixes against real production cases before shipping.

To get started, dive into the docs.

Why Judgeval

OpenTelemetry-based tracing -- Instrument any function with @Tracer.observe(). Automatically captures inputs, outputs, and LLM token usage. Built on OpenTelemetry for full compatibility with existing observability stacks.

Agent judges -- Define prompt-based scorers to evaluate agent behaviors at scale. Judges produce structured behaviors — scored, labeled outputs that describe how your agent acted — which accumulate into a searchable record of agent behavior over time. Run judges against live production traffic or replay them on historical traces to validate fixes before shipping.

Online monitoring -- Automatically score live production traffic server-side with no latency impact. Detected behaviors surface as structured signals — configure Slack alerts so regressions and recurrences never go unnoticed.

Broad integrations -- Auto-instrumentation for OpenAI, Anthropic, Google GenAI, and Together AI. Framework support for LangGraph, OpenLit, and Claude Agent SDK.

Quickstart

Install the SDK:

pip install judgeval

Set your credentials:

export JUDGMENT_API_KEY=...
export JUDGMENT_ORG_ID=...

Add observability to your agent with two lines of setup:

from judgeval import Tracer, wrap
from openai import OpenAI

Tracer.init(project_name="my-project")
client = wrap(OpenAI())

@Tracer.observe(span_type="tool")
def search(query: str) -> str:
    results = vector_db.search(query)
    return results

@Tracer.observe(span_type="agent")
def run_agent(question: str) -> str:
    context = search(question)
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": f"{context}\n\n{question}"}],
    )
    return response.choices[0].message.content

run_agent("What is the capital of the United States?")

Integrations

Supports OpenAI, Anthropic, Google GenAI, Together AI, LangGraph, OpenLit, and Claude Agent SDK. See the full integrations docs.

CLI

Manage agents, traces, judges, behaviors, and evaluations from the terminal. Query trace history, deploy judges, inspect detected behaviors, and run evals against production data — all without leaving your shell. See the CLI repo and docs.

MCP Server

Connect Judgment to any MCP-compatible AI tool. Query agent traces, invoke judges, browse detected behaviors, and surface failures directly inside your AI assistant or IDE. See the docs.

Links

Documentation

Judgeval is created and maintained by Judgment Labs.

Name		Name	Last commit message	Last commit date
Latest commit History 1,754 Commits
.github		.github
assets		assets
examples		examples
scripts		scripts
src		src
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE.md		LICENSE.md
README.md		README.md
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
update_version.py		update_version.py
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

The Continuous-Improvement Stack for Agents

Overview

Why Judgeval

Quickstart

Integrations

CLI

MCP Server

Links

About

Uh oh!

Releases 103

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

The Continuous-Improvement Stack for Agents

Overview

Why Judgeval

Quickstart

Integrations

CLI

MCP Server

Links

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 103

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages