PLN-544: Add 128K context budget awareness to judges by mikeangstadt · Pull Request #93 · closedloop-ai/claude-plugins

mikeangstadt · 2026-05-15T16:40:52Z

Summary

128K context budget awareness for judges: When CLOSEDLOOP_CONTEXT_LIMIT is set and indicates a context window smaller than 200K tokens, the context manager dynamically computes a reduced token budget and sets context_128k_mode=true. The run-judges skill reads this flag and skips all judges whose estimated prompt would exceed the available budget, recording them as final_status=3 CaseScores with a fixed "Skipped:" justification.
Dynamic token budget in context-manager-for-judges: Replaces the hardcoded 30K token ceiling with min(30000, CLOSEDLOOP_CONTEXT_LIMIT - 98000), falling back to 30K when the env var is unset. Budget tables now use percentage-based formulas instead of fixed token values.
CLOSEDLOOP_CONTEXT_LIMIT forwarding in run-loop.sh: The orchestration loop now exports CLOSEDLOOP_CONTEXT_LIMIT (when set by the caller) so downstream subagents and skills can detect the reduced context window.
Skipped-judge validation support: validate_judge_report.py gains a SKIP_SENTINEL constant and CaseScore.is_skipped() method so that skipped judges with empty metrics still pass report validation.
Test hardening: 180+ lines of new tests covering skipped-judge tolerance (top-level and metric-level justification, all-skipped reports, mixed reports). Also sanitizes CLOSEDLOOP_COMMAND/LAST_CLAUDE_COMMAND env vars in test_run_loop_failure_marker.py to prevent ambient shell state from flipping assertions.
Pyright config: Adds venvPath/venv to pyproject.toml for correct virtual-env resolution.

Changed Files

File	Change
`plugins/code/scripts/run-loop.sh`	Forward `CLOSEDLOOP_CONTEXT_LIMIT` to subagents
`plugins/code/tools/python/test_run_loop_failure_marker.py`	Sanitize env vars leaking into tests
`plugins/code/.claude-plugin/plugin.json`	Version bump → 1.12.0
`plugins/judges/agents/context-manager-for-judges.md`	Dynamic token budget, 128K mode detection
`plugins/judges/skills/run-judges/SKILL.md`	Step 0.5 budget check, skip logic, error CaseScores
`plugins/judges/skills/run-judges/scripts/validate_judge_report.py`	`SKIP_SENTINEL`, `is_skipped()`, relaxed empty-metrics validation
`plugins/judges/skills/run-judges/scripts/test_validate_judge_report.py`	180+ lines of skipped-judge tests
`plugins/judges/.claude-plugin/plugin.json`	Version bump → 1.8.3
`pyproject.toml`	Add `venvPath`/`venv` for Pyright

Test Plan

All existing pytest tests pass
New TestSkippedJudges class covers is_skipped() for top-level and metric-level justification, non-skipped statuses, and edge cases
Skipped-judge reports (single, all-skipped, mixed) pass validate_report()
test_run_loop_failure_marker.py env sanitization prevents false positives from ambient shell state

Loop ID: 019e2c30-0f0f-7608-a6d1-a2b233433ade
Artifact: https://app.closedloop.ai/implementation-plans/PLN-544

- Add CLOSEDLOOP_CONTEXT_LIMIT forwarding in run-loop.sh so subagents and skills can detect reduced context windows - Make context-manager-for-judges compute token budget dynamically from CLOSEDLOOP_CONTEXT_LIMIT instead of fixed 30K ceiling - Add Step 0.5 pre-execution context budget check in run-judges SKILL.md to skip judges when prompt exceeds 128K budget - Add is_skipped() method and SKIP_SENTINEL to validate_judge_report so skipped judges (final_status=3 with "Skipped:" justification) pass validation with empty metrics - Add comprehensive tests for skipped judge tolerance, all-skipped reports, and mixed skipped/normal reports - Sanitize CLOSEDLOOP_COMMAND env in test_run_loop_failure_marker to prevent ambient shell state from flipping test assertions - Add venvPath/venv to pyproject.toml for Pyright venv resolution Testing: pytest on new and existing tests in judges and code plugins Risks: None identified — backward compatible when CLOSEDLOOP_CONTEXT_LIMIT is unset Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

thadeusb · 2026-05-15T17:17:21Z

+}
+```
+
+The justification string MUST be exactly `"Skipped: artifact exceeded context budget after compression"` — do not include token counts or other dynamic text. This fixed string allows downstream consumers to distinguish skipped judges from runtime failures.


This says justification MUST be exactly the fixed string, no dynamic text. But line ~1331 in the recovery section says "Justification must state the skip reason explicitly: 128K mode is active, estimated tokens, and available budget." An LLM executing this skill will get two contradictory instructions. Pick one.

thadeusb · 2026-05-15T17:17:24Z

+}
+```
+
+Substitute `{judge-name}` with the actual agent identifier (e.g., `dry-judge`, `ssot-judge`), `{primary_metric_name}` with the judge's primary metric name (e.g., `dry_score`, `ssot_score`), and the token counts with the computed values.


Sub-step D says "Substitute {judge-name} with the actual agent identifier... and the token counts with the computed values." But the template has no token count placeholders. The justification string is hardcoded with no {token} fields. This reads like a stale edit that forgot to remove the token-count substitution instruction after switching to a fixed string.

thadeusb · 2026-05-15T17:17:28Z

+        assert valid is True, f"Expected valid report with skipped judge via metric, got: {message}"
+
+    def test_multiple_skipped_judges_pass_validation(self, tmp_path: Path) -> None:
+        """Report with multiple skipped judges (all have case_ids) passes validation."""


The SKILL.md says "either all judges fit or none fit for a given run." So the real 128K scenario is all 16 skipped, zero normal. This test only skips 3 of 16. Add a test where every judge in the report is a skipped CaseScore with empty metrics.

thadeusb · 2026-05-15T17:17:32Z

  "name": "judges",
  "description": "LLM Judges plugin",
-  "version": "1.6.1",
+  "version": "1.8.3",


Main is at 1.7.1, this PR sets 1.8.3 from a base of 1.6.1. Needs a rebase. Also 1.8.3 is a PATCH, but per CLAUDE.md this is a feature addition (new skip logic in run-judges), so it should be a MINOR bump like 1.8.0.

thadeusb · 2026-05-15T17:17:34Z

  "name": "code",
  "description": "Code and planning framework plugin",
-  "version": "1.11.16",
+  "version": "1.12.0",


Main is at 1.11.20. This sets 1.12.0 from a stale base of 1.11.16. This will either merge-conflict or silently downgrade the version. Needs a rebase.

mikeangstadt added the symphony label May 15, 2026

thadeusb reviewed May 15, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PLN-544: Add 128K context budget awareness to judges#93

PLN-544: Add 128K context budget awareness to judges#93
mikeangstadt wants to merge 1 commit into
mainfrom
symphony/pln-544

mikeangstadt commented May 15, 2026

Uh oh!

thadeusb May 15, 2026

Uh oh!

thadeusb May 15, 2026

Uh oh!

thadeusb May 15, 2026

Uh oh!

thadeusb May 15, 2026

Uh oh!

thadeusb May 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mikeangstadt commented May 15, 2026

Summary

Changed Files

Test Plan

Uh oh!

thadeusb May 15, 2026

Choose a reason for hiding this comment

Uh oh!

thadeusb May 15, 2026

Choose a reason for hiding this comment

Uh oh!

thadeusb May 15, 2026

Choose a reason for hiding this comment

Uh oh!

thadeusb May 15, 2026

Choose a reason for hiding this comment

Uh oh!

thadeusb May 15, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants