Skip to content

PLN-544: Add 128K context budget awareness to judges#93

Open
mikeangstadt wants to merge 1 commit into
mainfrom
symphony/pln-544
Open

PLN-544: Add 128K context budget awareness to judges#93
mikeangstadt wants to merge 1 commit into
mainfrom
symphony/pln-544

Conversation

@mikeangstadt
Copy link
Copy Markdown
Collaborator

Summary

  • 128K context budget awareness for judges: When CLOSEDLOOP_CONTEXT_LIMIT is set and indicates a context window smaller than 200K tokens, the context manager dynamically computes a reduced token budget and sets context_128k_mode=true. The run-judges skill reads this flag and skips all judges whose estimated prompt would exceed the available budget, recording them as final_status=3 CaseScores with a fixed "Skipped:" justification.
  • Dynamic token budget in context-manager-for-judges: Replaces the hardcoded 30K token ceiling with min(30000, CLOSEDLOOP_CONTEXT_LIMIT - 98000), falling back to 30K when the env var is unset. Budget tables now use percentage-based formulas instead of fixed token values.
  • CLOSEDLOOP_CONTEXT_LIMIT forwarding in run-loop.sh: The orchestration loop now exports CLOSEDLOOP_CONTEXT_LIMIT (when set by the caller) so downstream subagents and skills can detect the reduced context window.
  • Skipped-judge validation support: validate_judge_report.py gains a SKIP_SENTINEL constant and CaseScore.is_skipped() method so that skipped judges with empty metrics still pass report validation.
  • Test hardening: 180+ lines of new tests covering skipped-judge tolerance (top-level and metric-level justification, all-skipped reports, mixed reports). Also sanitizes CLOSEDLOOP_COMMAND/LAST_CLAUDE_COMMAND env vars in test_run_loop_failure_marker.py to prevent ambient shell state from flipping assertions.
  • Pyright config: Adds venvPath/venv to pyproject.toml for correct virtual-env resolution.

Changed Files

File Change
plugins/code/scripts/run-loop.sh Forward CLOSEDLOOP_CONTEXT_LIMIT to subagents
plugins/code/tools/python/test_run_loop_failure_marker.py Sanitize env vars leaking into tests
plugins/code/.claude-plugin/plugin.json Version bump → 1.12.0
plugins/judges/agents/context-manager-for-judges.md Dynamic token budget, 128K mode detection
plugins/judges/skills/run-judges/SKILL.md Step 0.5 budget check, skip logic, error CaseScores
plugins/judges/skills/run-judges/scripts/validate_judge_report.py SKIP_SENTINEL, is_skipped(), relaxed empty-metrics validation
plugins/judges/skills/run-judges/scripts/test_validate_judge_report.py 180+ lines of skipped-judge tests
plugins/judges/.claude-plugin/plugin.json Version bump → 1.8.3
pyproject.toml Add venvPath/venv for Pyright

Test Plan

  • All existing pytest tests pass
  • New TestSkippedJudges class covers is_skipped() for top-level and metric-level justification, non-skipped statuses, and edge cases
  • Skipped-judge reports (single, all-skipped, mixed) pass validate_report()
  • test_run_loop_failure_marker.py env sanitization prevents false positives from ambient shell state

Loop ID: 019e2c30-0f0f-7608-a6d1-a2b233433ade
Artifact: https://app.closedloop.ai/implementation-plans/PLN-544

- Add CLOSEDLOOP_CONTEXT_LIMIT forwarding in run-loop.sh so
  subagents and skills can detect reduced context windows
- Make context-manager-for-judges compute token budget dynamically
  from CLOSEDLOOP_CONTEXT_LIMIT instead of fixed 30K ceiling
- Add Step 0.5 pre-execution context budget check in run-judges
  SKILL.md to skip judges when prompt exceeds 128K budget
- Add is_skipped() method and SKIP_SENTINEL to validate_judge_report
  so skipped judges (final_status=3 with "Skipped:" justification)
  pass validation with empty metrics
- Add comprehensive tests for skipped judge tolerance, all-skipped
  reports, and mixed skipped/normal reports
- Sanitize CLOSEDLOOP_COMMAND env in test_run_loop_failure_marker
  to prevent ambient shell state from flipping test assertions
- Add venvPath/venv to pyproject.toml for Pyright venv resolution

Testing: pytest on new and existing tests in judges and code plugins
Risks: None identified — backward compatible when CLOSEDLOOP_CONTEXT_LIMIT is unset

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
}
```

The justification string MUST be exactly `"Skipped: artifact exceeded context budget after compression"` — do not include token counts or other dynamic text. This fixed string allows downstream consumers to distinguish skipped judges from runtime failures.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This says justification MUST be exactly the fixed string, no dynamic text. But line ~1331 in the recovery section says "Justification must state the skip reason explicitly: 128K mode is active, estimated tokens, and available budget." An LLM executing this skill will get two contradictory instructions. Pick one.

}
```

Substitute `{judge-name}` with the actual agent identifier (e.g., `dry-judge`, `ssot-judge`), `{primary_metric_name}` with the judge's primary metric name (e.g., `dry_score`, `ssot_score`), and the token counts with the computed values.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sub-step D says "Substitute {judge-name} with the actual agent identifier... and the token counts with the computed values." But the template has no token count placeholders. The justification string is hardcoded with no {token} fields. This reads like a stale edit that forgot to remove the token-count substitution instruction after switching to a fixed string.

assert valid is True, f"Expected valid report with skipped judge via metric, got: {message}"

def test_multiple_skipped_judges_pass_validation(self, tmp_path: Path) -> None:
"""Report with multiple skipped judges (all have case_ids) passes validation."""
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The SKILL.md says "either all judges fit or none fit for a given run." So the real 128K scenario is all 16 skipped, zero normal. This test only skips 3 of 16. Add a test where every judge in the report is a skipped CaseScore with empty metrics.

"name": "judges",
"description": "LLM Judges plugin",
"version": "1.6.1",
"version": "1.8.3",
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Main is at 1.7.1, this PR sets 1.8.3 from a base of 1.6.1. Needs a rebase. Also 1.8.3 is a PATCH, but per CLAUDE.md this is a feature addition (new skip logic in run-judges), so it should be a MINOR bump like 1.8.0.

"name": "code",
"description": "Code and planning framework plugin",
"version": "1.11.16",
"version": "1.12.0",
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Main is at 1.11.20. This sets 1.12.0 from a stale base of 1.11.16. This will either merge-conflict or silently downgrade the version. Needs a rebase.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants