SkillOpt: Executive Strategy for Self-Evolving Agent Skills

Train agent skills like you train neural networks — with epochs, (mini-)batchsize, learning rates, and validation gates — but without touching model weights.

📖 For installation, data preparation, training/eval commands, the full configuration reference, and framework internals, see the Documentation & Reproduction Guide — view it rendered online or via GitHub Pages.

News 🔥🔥🔥

[2026-06-08] 😴 SkillOpt-Sleep is here — plugins for Claude Code, Codex, and Copilot. Give your local coding agent a nightly sleep cycle: it reviews your past sessions offline, replays your recurring tasks, and consolidates validated long-term memory + skills behind a held-out gate, so it gets better the more you use it. Validated on the public gbrain-evals skillopt-v1 benchmark with real Claude and Codex (deficient skills 0.00 → 1.00 on held-out, all 4 seeds). It's an open-source tool decoupled from the paper code. See plugins/ and the SkillOpt-Sleep section below.
[2026-06-03] 🎉 gbrain, gbrain-evals, and darwin-skill have all integrated SkillOpt.
[2026-06-02] 🎉 SkillOpt v0.1.0 is now available on PyPI! Install with pip install skillopt. This initial release includes the full training loop (rollout → reflect → aggregate → select → update → evaluate), multi-backend support (OpenAI / Azure / Claude / Qwen / MiniMax), six built-in benchmarks, and WebUI dashboard.

Overview

Modern agent skills are usually hand-crafted, generated one-shot by a strong LLM, or evolved through loosely controlled self-revision — none of which behaves like a deep-learning optimizer for the skill itself, and none of which reliably improves over its starting point under feedback.

SkillOpt treats the skill document as the trainable state of a frozen agent, and trains it with the discipline that makes weight-space optimization reproducible. A separate optimizer model turns scored rollouts into bounded add / delete / replace edits on a single skill document; a candidate edit is accepted only when it strictly improves a held-out validation score. A textual learning-rate budget, a rejected-edit buffer, and an epoch-wise slow / meta update make skill training stable while adding zero inference-time model calls at deployment.

The deployed artifact is a compact best_skill.md (typically 300–2,000 tokens) that runs against the unchanged target model. Across six benchmarks, seven target models, and three execution harnesses (direct chat, Codex CLI, Claude Code CLI), SkillOpt is best or tied-best on all 52 evaluated (model, benchmark, harness) cells and on GPT-5.5 lifts the average no-skill accuracy by +23.5 points in direct chat, +24.8 inside the Codex agentic loop, and +19.1 inside Claude Code. Optimized skill artifacts transfer across model scales, between Codex and Claude Code harnesses, and to nearby benchmarks without further optimization.

For the full method, ablations, and per-cell results see the paper; for a visual walkthrough of the loop see the project page; for deeper API / backend / benchmark docs see docs/.

🎬 Demo Video

64c8f76086bed7bd7a5ce664a7a14f40_raw.mp4

▶ Watch the full demo on YouTube

😴 SkillOpt-Sleep — the deployment-time companion

SkillOpt (above) trains a skill offline on a benchmark. SkillOpt-Sleep applies the same discipline to your own daily usage: it gives a local coding agent a nightly sleep cycle that reviews your past sessions, replays your recurring tasks on your own API budget, and consolidates what it learns into validated long-term memory and skills — behind a held-out gate, staged for your review. The agent gets better the more you use it, with no weight training.

It synthesizes SkillOpt (validation-gated bounded text edits), Claude Dreams (offline consolidation; review-then-adopt), and the agent sleep idea (short-term experience → long-term competence). One "night":

harvest session transcripts → mine recurring tasks → replay offline
   → consolidate (reflect → bounded edit → GATE on real held-out tasks)
   → stage proposal → (you) adopt

Plugins for three agents (one engine, three thin shells — see plugins/):

Platform	Folder	Install
Claude Code	`plugins/claude-code`	`/plugin marketplace add ./plugins/claude-code` → `/sleep`
Codex	`plugins/codex`	`bash plugins/codex/install.sh` → `/sleep`
Copilot	`plugins/copilot`	register `plugins/copilot/mcp_server.py` as an MCP server

Validated on real models. On the public gbrain-evals skillopt-v1 benchmark, deficient skills go 0.00 → 1.00 on held-out sets with both Claude and Codex (all 4 seeds, including a real tool-use loop), cross-model transfer is positive, and the gate blocks regressions (full results).

Open-source tool, decoupled from the research. The engine lives in the top-level skillopt_sleep/ package with zero dependency on the paper's skillopt/ experiment code (the validation gate is vendored). Controls — optional gate, multi-rollout contrastive reflection, token/time budget, multi-objective reward, user preferences, optimizer/target split — are documented in docs/sleep/CONTROLLABLE_DREAMING.md.

Deterministic proof (no API key): python -m skillopt_sleep.experiments.run_experiment --persona researcher --assert-improves.

Extensibility & WebUI

Adding a new backend

A backend = a chat / exec target (e.g. openai_chat, claude_chat, qwen_chat, minimax_chat, codex_exec, claude_code_exec). See docs/guide/new-backend.md for the full contract; in short you add a skillopt/model/<name>_backend.py module, register it in skillopt/model/common.py + backend_config.py, and wire it through the router in skillopt/model/__init__.py. qwen_backend.py and minimax_backend.py are good templates.

Adding a new benchmark

A benchmark = a skillopt/envs/<name>/ package with a dataloader.py, a rollout.py, and an initial.md seed skill. See docs/guide/new-benchmark.md for the full contract; the simplest reference is skillopt/envs/searchqa/.

WebUI

Launch the monitoring dashboard (optional):

pip install -e ".[webui]"
python -m skillopt_webui.app

Flag	Default	Description
`--port`	7860	Server port
`--host`	`0.0.0.0`	Bind address
`--share`	off	Create a public Gradio share link

Citation

@misc{yang2026skilloptexecutivestrategyselfevolving,
      title={SkillOpt: Executive Strategy for Self-Evolving Agent Skills}, 
      author={Yifan Yang and Ziyang Gong and Weiquan Huang and Qihao Yang and Ziwei Zhou and Zisu Huang and Yan Li and Xuemei Gao and Qi Dai and Bei Liu and Kai Qiu and Yuqing Yang and Dongdong Chen and Xue Yang and Chong Luo},
      year={2026},
      eprint={2605.23904},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2605.23904}
}

Name		Name	Last commit message	Last commit date
Latest commit History 137 Commits
ckpt		ckpt
configs		configs
data		data
docs		docs
plugins		plugins
scripts		scripts
skillopt-assets		skillopt-assets
skillopt		skillopt
skillopt_sleep		skillopt_sleep
skillopt_webui		skillopt_webui
tests		tests
.env.example		.env.example
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
PUBLISHING.md		PUBLISHING.md
README.md		README.md
SECURITY.md		SECURITY.md
index.html		index.html
mkdocs.yml		mkdocs.yml
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
skillopt.html		skillopt.html

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SkillOpt: Executive Strategy for Self-Evolving Agent Skills

News 🔥🔥🔥

Overview

🎬 Demo Video

😴 SkillOpt-Sleep — the deployment-time companion

Extensibility & WebUI

Adding a new backend

Adding a new benchmark

WebUI

Citation

About

Uh oh!

Releases 1

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SkillOpt: Executive Strategy for Self-Evolving Agent Skills

News 🔥🔥🔥

Overview

🎬 Demo Video

😴 SkillOpt-Sleep — the deployment-time companion

Extensibility & WebUI

Adding a new backend

Adding a new benchmark

WebUI

Citation

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages