PLN-544: Add 128K context budget awareness to judges#93
Conversation
- Add CLOSEDLOOP_CONTEXT_LIMIT forwarding in run-loop.sh so subagents and skills can detect reduced context windows - Make context-manager-for-judges compute token budget dynamically from CLOSEDLOOP_CONTEXT_LIMIT instead of fixed 30K ceiling - Add Step 0.5 pre-execution context budget check in run-judges SKILL.md to skip judges when prompt exceeds 128K budget - Add is_skipped() method and SKIP_SENTINEL to validate_judge_report so skipped judges (final_status=3 with "Skipped:" justification) pass validation with empty metrics - Add comprehensive tests for skipped judge tolerance, all-skipped reports, and mixed skipped/normal reports - Sanitize CLOSEDLOOP_COMMAND env in test_run_loop_failure_marker to prevent ambient shell state from flipping test assertions - Add venvPath/venv to pyproject.toml for Pyright venv resolution Testing: pytest on new and existing tests in judges and code plugins Risks: None identified — backward compatible when CLOSEDLOOP_CONTEXT_LIMIT is unset Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
| } | ||
| ``` | ||
|
|
||
| The justification string MUST be exactly `"Skipped: artifact exceeded context budget after compression"` — do not include token counts or other dynamic text. This fixed string allows downstream consumers to distinguish skipped judges from runtime failures. |
There was a problem hiding this comment.
This says justification MUST be exactly the fixed string, no dynamic text. But line ~1331 in the recovery section says "Justification must state the skip reason explicitly: 128K mode is active, estimated tokens, and available budget." An LLM executing this skill will get two contradictory instructions. Pick one.
| } | ||
| ``` | ||
|
|
||
| Substitute `{judge-name}` with the actual agent identifier (e.g., `dry-judge`, `ssot-judge`), `{primary_metric_name}` with the judge's primary metric name (e.g., `dry_score`, `ssot_score`), and the token counts with the computed values. |
There was a problem hiding this comment.
Sub-step D says "Substitute {judge-name} with the actual agent identifier... and the token counts with the computed values." But the template has no token count placeholders. The justification string is hardcoded with no {token} fields. This reads like a stale edit that forgot to remove the token-count substitution instruction after switching to a fixed string.
| assert valid is True, f"Expected valid report with skipped judge via metric, got: {message}" | ||
|
|
||
| def test_multiple_skipped_judges_pass_validation(self, tmp_path: Path) -> None: | ||
| """Report with multiple skipped judges (all have case_ids) passes validation.""" |
There was a problem hiding this comment.
The SKILL.md says "either all judges fit or none fit for a given run." So the real 128K scenario is all 16 skipped, zero normal. This test only skips 3 of 16. Add a test where every judge in the report is a skipped CaseScore with empty metrics.
| "name": "judges", | ||
| "description": "LLM Judges plugin", | ||
| "version": "1.6.1", | ||
| "version": "1.8.3", |
There was a problem hiding this comment.
Main is at 1.7.1, this PR sets 1.8.3 from a base of 1.6.1. Needs a rebase. Also 1.8.3 is a PATCH, but per CLAUDE.md this is a feature addition (new skip logic in run-judges), so it should be a MINOR bump like 1.8.0.
| "name": "code", | ||
| "description": "Code and planning framework plugin", | ||
| "version": "1.11.16", | ||
| "version": "1.12.0", |
There was a problem hiding this comment.
Main is at 1.11.20. This sets 1.12.0 from a stale base of 1.11.16. This will either merge-conflict or silently downgrade the version. Needs a rebase.
Summary
CLOSEDLOOP_CONTEXT_LIMITis set and indicates a context window smaller than 200K tokens, the context manager dynamically computes a reduced token budget and setscontext_128k_mode=true. Therun-judgesskill reads this flag and skips all judges whose estimated prompt would exceed the available budget, recording them asfinal_status=3CaseScores with a fixed"Skipped:"justification.min(30000, CLOSEDLOOP_CONTEXT_LIMIT - 98000), falling back to 30K when the env var is unset. Budget tables now use percentage-based formulas instead of fixed token values.CLOSEDLOOP_CONTEXT_LIMITforwarding in run-loop.sh: The orchestration loop now exportsCLOSEDLOOP_CONTEXT_LIMIT(when set by the caller) so downstream subagents and skills can detect the reduced context window.validate_judge_report.pygains aSKIP_SENTINELconstant andCaseScore.is_skipped()method so that skipped judges with empty metrics still pass report validation.CLOSEDLOOP_COMMAND/LAST_CLAUDE_COMMANDenv vars intest_run_loop_failure_marker.pyto prevent ambient shell state from flipping assertions.venvPath/venvtopyproject.tomlfor correct virtual-env resolution.Changed Files
plugins/code/scripts/run-loop.shCLOSEDLOOP_CONTEXT_LIMITto subagentsplugins/code/tools/python/test_run_loop_failure_marker.pyplugins/code/.claude-plugin/plugin.jsonplugins/judges/agents/context-manager-for-judges.mdplugins/judges/skills/run-judges/SKILL.mdplugins/judges/skills/run-judges/scripts/validate_judge_report.pySKIP_SENTINEL,is_skipped(), relaxed empty-metrics validationplugins/judges/skills/run-judges/scripts/test_validate_judge_report.pyplugins/judges/.claude-plugin/plugin.jsonpyproject.tomlvenvPath/venvfor PyrightTest Plan
pytesttests passTestSkippedJudgesclass coversis_skipped()for top-level and metric-level justification, non-skipped statuses, and edge casesvalidate_report()test_run_loop_failure_marker.pyenv sanitization prevents false positives from ambient shell stateLoop ID: 019e2c30-0f0f-7608-a6d1-a2b233433ade
Artifact: https://app.closedloop.ai/implementation-plans/PLN-544