Part of #380
Depends on: #385
What
Dashboard showing historical evaluation runs for a given scenario — with detail view, run comparison, and regression detection.
- List of past eval runs: timestamp, version, pass/fail, key metrics summary
- Run detail view: per-metric breakdown, inputs/outputs sample
- Comparison view: select two runs side-by-side, highlight metric deltas
- Regression indicator: flag runs where metrics regressed beyond a threshold vs prior run
- Data sourced from Scouter via OpsML proxy
Part of #380
Depends on: #385
What
Dashboard showing historical evaluation runs for a given scenario — with detail view, run comparison, and regression detection.