🌐 Homepage | 🤗 Datasets | 📖 Paper
This repository contains the evaluation code for the paper PresentBench: A Fine-Grained Rubric-Based Benchmark for Slide Generation.
Slides serve as a critical medium for conveying information in presentation-oriented scenarios such as academia, education, and business. Despite their importance, creating high-quality slide decks remains time-consuming and cognitively demanding. Recent advances in generative models, such as Nano Banana Pro, have made automated slide generation increasingly feasible. However, existing evaluations of slide generation are often coarse-grained and rely on holistic judgments, making it difficult to accurately assess model capabilities or track meaningful advances in the field. In practice, the lack of fine-grained, verifiable evaluation criteria poses a critical bottleneck for both research and real-world deployment.
In this paper, we propose PresentBench, a fine-grained, rubric-based benchmark for evaluating automated real-world slide generation. It contains 238 evaluation instances, each supplemented with background materials required for slide creation. Moreover, we manually design an average of 54.1 checklist items per instance, each formulated as a binary question, to enable fine-grained, instance-specific evaluation of the generated slide decks.
Extensive experiments show that PresentBench provides more reliable evaluation results than existing methods, and exhibits significantly stronger alignment with human preferences. Furthermore, our benchmark reveals that NotebookLM significantly outperforms other slide generation methods, highlighting substantial recent progress in this domain.
-
Instance-Specific, Fine-Grained Criteria
Existing evaluation frameworks often adopt instance-agnostic scoring schemes, typically relying on a judging paradigm that poses the same set of general questions for all slide decks. Such evaluations fail to account for instance-specific content, making it difficult to assess if a slide generation system truly follows the intended input.
PresentBench establishes fine-grained checklist items tailored to each slide deck instance. On average, each instance is associated with more than 50 specifically designed atomic evaluation items, converting vague qualitative grading into verifiable binary checks.
-
Authentic, Grounded Scenarios
A large portion of prior work focuses on isolated subtasks or reference-free settings without grounding the task in concrete background materials. This creates a mismatch between evaluation settings and real-world usage.
For each instance in PresentBench, we curate authoritative background materials, such as top-tier conference papers, university course textbooks, and financial reports, and require systems to generate slides grounded in these materials. This design ensures that every task reflects realistic, end-to-end slide generation scenarios based on authentic sources.
conda create -n presentbench python=3.11
conda activate presentbench
pip install -r requirements.txtDownload the PresentBench dataset from the Hugging Face dataset repo PresentBench/PresentBench into ./data:
python scripts/download_data.pyThe script accepts three key arguments (all optional, defaults are already sensible – behavior is unchanged here):
--repo_id: Which Hugging Face dataset repo to pull from, e.g.PresentBench/PresentBench.--revision: Which git commit or tag of that dataset repo to use (for reproducibility).--data_root: Local folder where the dataset will be downloaded to (default is./data).
Your slide agent should take both the material and instructions as input to generate its outputs.
Before running evaluation, make sure your agent writes its outputs under a result root directory (<RESULT_ROOT>), such that the subtree mirrors data/:
<RESULT_ROOT>/<domain/.../case>/generation_task/results/
├── slides.pdf
└── (optionally) additional artifacts
<RESULT_ROOT> defaults to <repo_root>/results/<AgentName>. You can also use a custom path; pass it to judge_all.py via --result_root.
Evaluation uses external APIs (Gemini/OpenAI). Create a .env file in the repo root, and write the following content (Gemini example):
GENAI_API_KEY=xxxxxxx
GENAI_BASE_URL=https://xxxxx.com/
Run the following command for full-benchmark evaluation:
python judge_all.py --agent_name <AgentName>Key command-line arguments:
-
--agent_name(required)
Logical name of your agent. Used to construct the default result root atresults/<AgentName>/. (If the--result_rootargument is provided, this value will not be used.) -
--api_type(default:"gemini")
Which backend API to use when calling the judge (e.g.,geminiorgemini_inline). -
--model(default:"gemini-3-flash-preview")
Specific model name for the chosen API type. -
--thinking_level(optional) Optional knob for models that expose different thinking or reasoning levels. -
--data_root(optional)
Root directory of benchmark data. By default this is:<repo_root>/data -
--result_root(optional) Root directory where your agent’s slide decks live. By default this is:./results/<AgentName>/ -
--max_workers(optional)
Maximum number of worker processes for parallel evaluation. If omitted, it defaults to the number of CPU cores. -
--debug(optional) Run in single-process mode and pass--debugintojudge.py. Useful for local debugging. -
--min_timestamp(optional, formatYYYY-MM-DD_HH-MM-SS)
Result files with timestamps older than this time will be skipped, which is helpful when resuming partial runs.
You can also refer to the wrapper script bash judge_all.sh.
For each case, judge_all.py writes evaluation logs next to the corresponding slides. Logs are stored under:
<RESULT_ROOT>/<domain/.../case>/generation_task/results/log/YYYY-MM-DD_HH-MM-SS.log
The main components of the repository are as follows:
-
data/– benchmark test cases (download from Hugging Face).
Domains underdata/include (non‑exhaustive):academia/advertising/economics/education/talk/Each leaf case typically looks like:material.pdf|material.md|material_N.md|material_N.pdf– source documents (PDFs, text, etc.).generation_task/– generation instructions and evaluation configuration:instructions.mdjudge_prompt.json
-
results/(not tracked by git) – where your agent's generated slides are expected to live. The directory structure of each agent's results needs to match the dataset layout. A typical layout is:results/<AgentName>/<domain/.../case>/generation_task/results ├── slides.pdf └── (optionally) additional artifacts -
judge_all.py– main script to run evaluation on all benchmark cases. -
judge_all.sh– convenience shell wrapper aroundjudge_all.py. -
judge.py– entry point for scoring a single case. -
utils/– helpers used by the judging / statistics pipeline.
BibTeX:
@article{chen2026presentbench,
title={PresentBench: A Fine-Grained Rubric-Based Benchmark for Slide Generation},
author={Chen, Xin-Sheng and Zhu, Jiayu and Li, Pei-lin and Wang, Hanzheng and Yang, Shuojin and Guo, Meng-Hao},
journal={arXiv preprint arXiv:2603.07244},
year={2026}
}