Raw conversation transcripts and judge outputs for Same Question, Different
Answer: Latent Quality in LLMs Under Critical Engagement. Everything the
paper's analysis scripts (paper/figures/*.py, paper/analysis/*.py) reads
lives here.
| Path | What it is |
|---|---|
phase2/ |
Primary study: 14 open-ended + 2 objective tasks, 9 engagement conditions, DeepSeek V4 Pro focal model. Includes per-condition closing rubric scores and pairwise judge verdicts. |
replication_grok/ |
Replication study: 6-task subset, Grok 4.3 focal model, user-LLM and judge held constant from the primary study. |
tier1/ |
Comprehensive-critique kept prompt: four tier-1 tasks (A2, A5, C1, C2), three runs each. The production prompt used in phase2/critical_L3/ (mirrored here for the prompt-ablation comparison in Appendix B). |
tier1_all/ |
Comprehensive-critique persistence-directive variant: same four tasks and runs, ablation prompt with three added persistence directives. Used in Appendix B. |
tier1_bookend/ |
Non-think reasoning-mode bookend: four tier-1 tasks under three primary conditions, focal model run without extended-reasoning enabled. Used in Appendix D. |
pilot/ |
Pre-study sample used to validate the conversation runner. Not part of any reported result. |
art_study/ |
Taxonomy pilot conversations and per-pair pairwise judge outputs (4 overlap topics; human researcher engaging Claude Opus 4.7). Used in Appendix A. |
manipulation_check.json |
Aggregated manipulation-check statistics across primary and replication studies (Appendix C). |
tier1_bookend_*.log |
Run logs for the non-think bookend experiments. Diagnostic only. |
Each <task>_run<N>.json contains:
The user_model field in phase2/config.yaml and the per-conversation JSONs
reads gemini-3.1-pro. This is a stale YAML key, not a true reflection of
the runtime: the production user-LLM was Claude Sonnet 4.6, switched in
phase 1.5 of the project (see docs/phases/phase-1.5-retro.md). The config
key was never updated. Paper §4 (Models) is authoritative on which models
were used.
Every figure and reported number regenerates from paper/figures/ (figures)
and paper/analysis/ (prose numbers), reading from this directory. Run from
the repo root:
uv sync
uv run python paper/figures/fig2_dimension_heatmap.py # example
{ "task_id": "A2", "condition": "critical_L3", "user_model": "gemini-3.1-pro", // stale label, see note below "focal_model": "deepseek/deepseek-chat", "judge_model": "gpt-5.4", "turns": [ { "role": "user", "content": "..." }, { "role": "focal", "content": "..." }, ... ], "closing_response": "...", "rubric_scores": { ... } }