This directory contains the benchmark test suites and tooling for publishing to Hugging Face Hub.
datasets/
├── suites/ # Source of truth (human-editable JSON)
│ └── wp-core-v1/
│ ├── execution/ # Code generation tests (one file per category)
│ │ ├── hooks.json
│ │ ├── rest-api.json
│ │ └── ...
│ └── knowledge/ # Multiple choice / short answer tests
│ ├── hooks.json
│ ├── rest-api.json
│ └── ...
├── data/ # Generated Parquet for HF (gitignored)
│ └── test.parquet
├── export_dataset.py # Converts suites → Parquet
└── README.md
The harness loads directly from suites/ JSON files:
# wp-bench.yaml
dataset:
source: local
name: wp-core-v1-
Export to Parquet:
python datasets/export_dataset.py
-
Upload to HF Hub:
huggingface-cli upload WordPress/wp-bench-v1 datasets/data/
-
Users can then load:
from datasets import load_dataset ds = load_dataset("WordPress/wp-bench-v1", split="test")
- Create
suites/<suite-name>/execution/andknowledge/directories - Add category JSON files (e.g.,
hooks.json,rest-api.json) to each directory - Follow the schema in existing suites
- Run
python datasets/export_dataset.pyto include in Parquet export
| Field | Type | Description |
|---|---|---|
id |
string | Unique test ID |
prompt |
string | Task description for the model |
requirements |
array | List of requirements the solution must meet |
static_checks |
object | Regex patterns to check in generated code |
runtime_checks |
object | Assertions to run in WordPress environment |
reference_solution |
string | Example correct solution |
| Field | Type | Description |
|---|---|---|
id |
string | Unique test ID |
prompt |
string | Question text |
choices |
array | Multiple choice options [{key, text}] |
correct_answer |
string | Correct choice key (e.g., "B") |