SynthLadder is a chemistry benchmark and evaluation package for
agentic language models on synthesis-focused tasks.
It is designed to measure performance across increasing planning depth rather than as a single mixed chemistry test set.
benchmark_v3 is organized as a three-level ladder:
Level A: local reaction understandingLevel B: single-step retrosynthesisLevel C: route-level synthesis planning
The repository also includes an auxiliary grounding suite used for chemistry
role and condition extraction, but the primary paper-facing benchmark is the
A/B/C ladder.
The benchmark covers:
- organic chemistry reasoning
- retrosynthesis and route planning
- deterministic and rubric-based LLM evaluation
benchmark/- public benchmark artifacts, eval subsets, and manifests
synthladder/agents/- OpenShell runtime, tool registry, and SGR reasoning layer
synthladder/evaluation/- student stage, judge stage, and matrix runner
synthladder/guards/- typed validation and retry helpers for student and judge calls
synthladder/build/- reproducible benchmark construction utilities
external_sources/- source manifests, download links, and rebuild instructions
docs/- architecture, ladder semantics, judging, taxonomy, and runtime policy
The public repository intentionally excludes generated runs, scratch analysis, and legacy parsing pipelines that are not part of the final benchmark release.
Primary runtime entrypoint:
benchmark/benchmark_v3_eval.jsonl
Per-level public eval subsets:
benchmark/level_a_eval.jsonlbenchmark/level_b_eval.jsonlbenchmark/level_c_eval.jsonl
Supporting manifests:
benchmark/levels_manifest.yamlbenchmark/paper_eval_manifest.yamlbenchmark/LEVELS.md
Legacy compatibility/source layer retained for selected construction steps:
benchmark/benchmark_v1_0.jsonl
Requirements:
- Python
3.12+ uv- Docker for OpenShell-backed runs
- an OpenAI-compatible endpoint or local proxy
Setup:
git clone https://github.com/SergeiNikolenko/SynthLadder.git
cd SynthLadder
uv syncThe package is importable as synthladder.
Installed package entry points:
synthladder-studentsynthladder-judgesynthladder-matrixsynthladder-materializesynthladder-buildsynthladder-build-levelssynthladder-build-paper-eval
Module entry points:
python -m synthladder.evaluation.student_validationpython -m synthladder.evaluation.llm_judgepython -m synthladder.evaluation.run_full_matrixpython -m synthladder.evaluation.materialize_benchmark_v3_eval
Export your model endpoint and API key:
export API_BASE_URL="http://127.0.0.1:8317/v1"
export CLIPROXY_API_KEY="<your-api-key>"Quick health check:
curl -sS \
-H "Authorization: Bearer $CLIPROXY_API_KEY" \
"$API_BASE_URL/models"If you run OpenShell-backed evaluation, make sure Docker and the OpenShell gateway are available.
uv run synthladder-materialize \
--output benchmark/benchmark_v3_eval.jsonluv run synthladder-student \
--benchmark-path benchmark/benchmark_v3_eval.jsonl \
--output-path runs/repro/student_output.jsonl \
--api-base-url "$API_BASE_URL" \
--model-name "gpt-5.4-mini" \
--api-key "$CLIPROXY_API_KEY" \
--agent-sandbox openshell \
--agent-backend openshell_worker \
--agent-tools-profile tools \
--student-guard-enabled true \
--student-guard-mode on_failure \
--student-guard-retries 2uv run synthladder-judge \
--input-path runs/repro/student_output.jsonl \
--gold-path benchmark/benchmark_v3_eval.jsonl \
--judge-model "gpt-5.4-mini" \
--judge-model-url "$API_BASE_URL" \
--judge-api-key "$CLIPROXY_API_KEY" \
--judge-method g_eval \
--judge-g-eval-fallback-structured true \
--judge-structured-retries 2 \
--output-path runs/repro/llm_judge_output.jsonluv run synthladder-matrix \
--benchmark-path benchmark/benchmark_v3_eval.jsonl \
--api-base-url "$API_BASE_URL" \
--api-key "$CLIPROXY_API_KEY" \
--models gpt-5.4-mini \
--judge-model gpt-5.4-mini \
--agent-sandbox openshell \
--agent-backend openshell_worker \
--agent-tools-profile tools \
--student-guard-enabled true \
--student-guard-mode on_failure \
--student-guard-retries 2 \
--judge-method g_eval \
--judge-g-eval-fallback-structured true \
--judge-structured-retries 2Outputs are written into runs/<run-id>/.
The repository does not redistribute raw external datasets. To rebuild the larger benchmark pools:
- download the public sources listed in external_sources/README.md
- place them into the expected
external_sources/<group>/<source>/raw|extractedlayout - run:
uv run synthladder-build-levels
uv run synthladder-build-paper-eval
uv run synthladder-materialize \
--output benchmark/benchmark_v3_eval.jsonlBuild modules expose argparse CLIs, support --help, and fail fast on missing
source files or benchmark pools.
Restricted/commercial sources are documented in external_sources/blocked/ but
are not redistributed.
- The default integrity-oriented runtime is
OpenShell. localsandbox mode is a development fallback, not the preferred evaluation mode.- The student path includes a hidden SGR reasoning phase for supported runs.
- The judge combines deterministic scoring with rubric-based LLM evaluation.
score_methodinllm_judge_output.jsonlrecords whether a row was scored by:deterministicg_evalstructured_fallbackllm_judge
- docs/architecture.md - runtime and package structure
- docs/benchmark_ladder.md - benchmark semantics
- docs/benchmark_construction.md - source-to-benchmark construction
- docs/benchmark_taxonomy.md - taxonomy and reporting
- docs/g_eval.md - judge behavior
- docs/tools/README.md - operational commands
- docs/security_runbook.md - runtime and sandbox policy