End-to-End LLM Evaluation
End-to-end evaluation assesses the observable inputs and outputs of your LLM application and treats it as a black box β you only care about what goes in and what comes out, not the path the system took to get there. The shape of "input" and "output" depends entirely on what your app does:
- Tool-using agent treated as a black box β input is the user's task, output is the final answer plus the tools that were called.
- Multi-turn chatbot / support agent β input is the scenario the user is in, output is the full conversation.
- RAG / QA app β input is a question, output is the answer (and the retrieved context, if you want to score faithfulness).
- Document summarization β input is the source document, output is the summary.
- Classifier / extractor β input is a chunk of text, output is the label or the structured fields you pulled out.
- Writing assistant / rewriter β input is the draft (and any instructions), output is the rewritten text.

This page explains the concepts behind end-to-end evaluation. For the actual step-by-step walkthroughs, jump to the right flavor for your application:
- Single-Turn End-to-End Evals β for any LLM app where one input maps to one output (agents treated as a black box, RAG / QA, summarization, classifiers, etc.).
- Multi-Turn End-to-End Evals β for chatbots and conversational agents where the unit of evaluation is the whole conversation.
Treating Your App as a Black Box
In end-to-end evaluation, you only describe what's observable from outside your LLM application β the input you sent, the output that came back, and any context that was used along the way. You do not describe the retrieval algorithm, the chain of LLM calls inside an agent, or any internal reasoning steps. That's the whole point of "end-to-end": you're grading the result, not the path the system took to get there.
Concretely, the parameters you populate on a test case are the entire surface your metrics see.
For single-turn apps, you populate fields on an LLMTestCase:
inputβ what you sent into your app (the question, document, draft, task, etc.).actual_outputβ what your app produced (the answer, summary, label, rewritten text, agent's final reply).retrieval_contextβ for RAG-style apps, the chunks your retriever returned. Required by metrics likeFaithfulnessMetricandContextualRelevancyMetric.tools_calledβ for agentic apps, the tools the agent invoked. Required by metrics likeToolCorrectnessMetricandArgumentCorrectnessMetric.expected_output/expected_toolsβ optional gold references, used by reference-based metrics.contextβ optional extra background, used by some reference-based metrics.
For multi-turn apps, you populate fields on a ConversationalTestCase:
scenarioβ what the simulated user is trying to do.expected_outcomeβ what success looks like.user_descriptionβ who the user is (persona, role, constraints).turnsβ the sequence ofTurn(role, content)objects that make up the conversation.
Notice what's not there: there's no place to describe "the retriever's prompt", "the tool argument schema", or "the inner LLM call that produced this answer." If a metric needs to score one of those things in isolation, end-to-end isn't the right fit.
Single-Turn vs Multi-Turn
Pick the flavor that matches your application:
| Single-Turn | Multi-Turn | |
|---|---|---|
| Test case | LLMTestCase | ConversationalTestCase |
| Dataset entry | Golden | ConversationalGolden |
| What's evaluated | One input β one output | A full conversation (a sequence of Turns) |
| How test cases are made | You invoke your app on each golden and build the test case from the result | The ConversationSimulator drives a synthetic user against your chatbot until the scenario plays out |
| Typical apps | Agents-as-black-box, RAG / QA, summarization, classifiers, writing assistants | Chatbots, support agents, multi-turn assistants |
| Metric base class | BaseMetric | BaseConversationalMetric |
| Walkthrough | Single-Turn E2E Evals β | Multi-Turn E2E Evals β |
The two flavors live on different test case classes because the unit of evaluation is genuinely different (one exchange vs many), and deepeval will refuse to mix them in the same test run.
End-to-End vs Component-Level
End-to-end and component-level evaluation are not two separate workflows β they're the same workflow at different granularities. End-to-end evaluation is just component-level evaluation where the entire system is treated as one component with no internal steps. That's the only real difference.
In both cases you're attaching metrics to a unit of work and scoring the input/output of that unit:
- End-to-end β the unit is the whole app. One test case per run of your app, scoring the final input β final output.
- Component-level β the unit is each
@observe'd span. Many test cases per run of your app β one per span you've chosen to grade β each scoring the input β output of that span.
| End-to-End | Component-Level | |
|---|---|---|
| What you score | The final user-visible output (the system as one black-box component) | Individual internal spans (retriever, tool call, sub-agent, etc.) |
| How metrics are attached | To the test case (or to the trace as a whole) | To @observe(metrics=[...]) on each span |
| Best for | Anything with a "flat" architecture, or where you only care about the result | Complex agents, multi-step pipelines, anywhere different components need different metrics |
| Multi-turn supported | Yes | Single-turn only today |
You don't have to choose just one β and in fact, when you use the recommended evals_iterator() path, end-to-end and component-level run in the same loop: the metrics you pass to evals_iterator(metrics=[...]) are scored end-to-end, while any metrics you've attached to @observe(metrics=[...]) on individual spans are scored component-level. Many real applications run both, with end-to-end on the final answer and component-level on a few critical spans.
When should you choose end-to-end?
Choose end-to-end evaluation when:
- Your LLM application has a "flat" architecture that fits naturally into a single
LLMTestCase(agents treated as a black box, RAG / QA, summarization, single-shot classifiers, writing assistants, etc.) - Your application is multi-turn (chatbots, support agents) and you want to score the whole conversation rather than each step.
- Your application is a complex agent, but you've concluded that component-level evaluation gives you too much noise and you'd rather grade the final outcome.
In short: you care about the result, not the path the system took to get there. Most of the quickstart is end-to-end evaluation.
Two Ways to Run a Test Run
Both single-turn and (for evaluate()) multi-turn give you a choice between two equivalent code paths:
| Approach | What it looks like | When to choose it |
|---|---|---|
evaluate(test_cases=...) | Build a list of LLMTestCases (or ConversationalTestCases) up front, hand them to a single evaluate() call. | You want a self-contained script with no tracing dependency. |
dataset.evals_iterator() with @observe β recommended (single-turn only) | Decorate your app with @observe, loop over goldens with evals_iterator(metrics=[...]). deepeval builds the test cases from the captured trace. | Your app is (or will be) instrumented with tracing. You also get a full per-test-case trace view on Confident AI for free. |
For new single-turn projects we recommend evals_iterator() β same amount of code, plus traces, plus the same setup carries over to component-level evaluation later.
Multi-turn end-to-end evaluation only uses evaluate() today; the evals_iterator() form is single-turn only.
What's Next
- Walk through a single-turn end-to-end evaluation.
- Walk through a multi-turn end-to-end evaluation using the
ConversationSimulator. - Run end-to-end evals in CI/CD pipelines using
assert_test()anddeepeval test run. - Compare with component-level evaluation if your app has internal structure worth grading.