Skip to content

SergeiNikolenko/SynthLadder

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

128 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SynthLadder

SynthLadder is a chemistry benchmark and evaluation package for agentic language models on synthesis-focused tasks.

It is designed to measure performance across increasing planning depth rather than as a single mixed chemistry test set.

What The Benchmark Measures

benchmark_v3 is organized as a three-level ladder:

  • Level A: local reaction understanding
  • Level B: single-step retrosynthesis
  • Level C: route-level synthesis planning

The repository also includes an auxiliary grounding suite used for chemistry role and condition extraction, but the primary paper-facing benchmark is the A/B/C ladder.

The benchmark covers:

  • organic chemistry reasoning
  • retrosynthesis and route planning
  • deterministic and rubric-based LLM evaluation

Repository Layout

  • benchmark/
    • public benchmark artifacts, eval subsets, and manifests
  • synthladder/agents/
    • OpenShell runtime, tool registry, and SGR reasoning layer
  • synthladder/evaluation/
    • student stage, judge stage, and matrix runner
  • synthladder/guards/
    • typed validation and retry helpers for student and judge calls
  • synthladder/build/
    • reproducible benchmark construction utilities
  • external_sources/
    • source manifests, download links, and rebuild instructions
  • docs/
    • architecture, ladder semantics, judging, taxonomy, and runtime policy

The public repository intentionally excludes generated runs, scratch analysis, and legacy parsing pipelines that are not part of the final benchmark release.

Public Benchmark Files

Primary runtime entrypoint:

  • benchmark/benchmark_v3_eval.jsonl

Per-level public eval subsets:

  • benchmark/level_a_eval.jsonl
  • benchmark/level_b_eval.jsonl
  • benchmark/level_c_eval.jsonl

Supporting manifests:

  • benchmark/levels_manifest.yaml
  • benchmark/paper_eval_manifest.yaml
  • benchmark/LEVELS.md

Legacy compatibility/source layer retained for selected construction steps:

  • benchmark/benchmark_v1_0.jsonl

Installation

Requirements:

  • Python 3.12+
  • uv
  • Docker for OpenShell-backed runs
  • an OpenAI-compatible endpoint or local proxy

Setup:

git clone https://github.com/SergeiNikolenko/SynthLadder.git
cd SynthLadder
uv sync

The package is importable as synthladder.

CLI Entry Points

Installed package entry points:

  • synthladder-student
  • synthladder-judge
  • synthladder-matrix
  • synthladder-materialize
  • synthladder-build
  • synthladder-build-levels
  • synthladder-build-paper-eval

Module entry points:

  • python -m synthladder.evaluation.student_validation
  • python -m synthladder.evaluation.llm_judge
  • python -m synthladder.evaluation.run_full_matrix
  • python -m synthladder.evaluation.materialize_benchmark_v3_eval

Runtime Setup

Export your model endpoint and API key:

export API_BASE_URL="http://127.0.0.1:8317/v1"
export CLIPROXY_API_KEY="<your-api-key>"

Quick health check:

curl -sS \
  -H "Authorization: Bearer $CLIPROXY_API_KEY" \
  "$API_BASE_URL/models"

If you run OpenShell-backed evaluation, make sure Docker and the OpenShell gateway are available.

Reproducing Benchmark Execution

1. Materialize the runtime-facing benchmark file

uv run synthladder-materialize \
  --output benchmark/benchmark_v3_eval.jsonl

2. Run student inference

uv run synthladder-student \
  --benchmark-path benchmark/benchmark_v3_eval.jsonl \
  --output-path runs/repro/student_output.jsonl \
  --api-base-url "$API_BASE_URL" \
  --model-name "gpt-5.4-mini" \
  --api-key "$CLIPROXY_API_KEY" \
  --agent-sandbox openshell \
  --agent-backend openshell_worker \
  --agent-tools-profile tools \
  --student-guard-enabled true \
  --student-guard-mode on_failure \
  --student-guard-retries 2

3. Run the judge

uv run synthladder-judge \
  --input-path runs/repro/student_output.jsonl \
  --gold-path benchmark/benchmark_v3_eval.jsonl \
  --judge-model "gpt-5.4-mini" \
  --judge-model-url "$API_BASE_URL" \
  --judge-api-key "$CLIPROXY_API_KEY" \
  --judge-method g_eval \
  --judge-g-eval-fallback-structured true \
  --judge-structured-retries 2 \
  --output-path runs/repro/llm_judge_output.jsonl

4. Run the full matrix pipeline

uv run synthladder-matrix \
  --benchmark-path benchmark/benchmark_v3_eval.jsonl \
  --api-base-url "$API_BASE_URL" \
  --api-key "$CLIPROXY_API_KEY" \
  --models gpt-5.4-mini \
  --judge-model gpt-5.4-mini \
  --agent-sandbox openshell \
  --agent-backend openshell_worker \
  --agent-tools-profile tools \
  --student-guard-enabled true \
  --student-guard-mode on_failure \
  --student-guard-retries 2 \
  --judge-method g_eval \
  --judge-g-eval-fallback-structured true \
  --judge-structured-retries 2

Outputs are written into runs/<run-id>/.

Rebuilding The Benchmark From Sources

The repository does not redistribute raw external datasets. To rebuild the larger benchmark pools:

  1. download the public sources listed in external_sources/README.md
  2. place them into the expected external_sources/<group>/<source>/raw|extracted layout
  3. run:
uv run synthladder-build-levels
uv run synthladder-build-paper-eval
uv run synthladder-materialize \
  --output benchmark/benchmark_v3_eval.jsonl

Build modules expose argparse CLIs, support --help, and fail fast on missing source files or benchmark pools.

Restricted/commercial sources are documented in external_sources/blocked/ but are not redistributed.

Evaluation Notes

  • The default integrity-oriented runtime is OpenShell.
  • local sandbox mode is a development fallback, not the preferred evaluation mode.
  • The student path includes a hidden SGR reasoning phase for supported runs.
  • The judge combines deterministic scoring with rubric-based LLM evaluation.
  • score_method in llm_judge_output.jsonl records whether a row was scored by:
    • deterministic
    • g_eval
    • structured_fallback
    • llm_judge

Documentation Map

About

Synthesis-focused chemistry benchmark and evaluation package for agentic LLMs across reaction understanding, retrosynthesis, and route planning tasks.

Topics

Resources

Stars

Watchers

Forks

Contributors