experiments

Experiment data

Raw conversation transcripts and judge outputs for Same Question, Different Answer: Latent Quality in LLMs Under Critical Engagement. Everything the paper's analysis scripts (paper/figures/*.py, paper/analysis/*.py) reads lives here.

Directory layout

Path	What it is
`phase2/`	Primary study: 14 open-ended + 2 objective tasks, 9 engagement conditions, DeepSeek V4 Pro focal model. Includes per-condition closing rubric scores and pairwise judge verdicts.
`replication_grok/`	Replication study: 6-task subset, Grok 4.3 focal model, user-LLM and judge held constant from the primary study.
`tier1/`	Comprehensive-critique kept prompt: four tier-1 tasks (A2, A5, C1, C2), three runs each. The production prompt used in `phase2/critical_L3/` (mirrored here for the prompt-ablation comparison in Appendix B).
`tier1_all/`	Comprehensive-critique persistence-directive variant: same four tasks and runs, ablation prompt with three added persistence directives. Used in Appendix B.
`tier1_bookend/`	Non-think reasoning-mode bookend: four tier-1 tasks under three primary conditions, focal model run without extended-reasoning enabled. Used in Appendix D.
`pilot/`	Pre-study sample used to validate the conversation runner. Not part of any reported result.
`art_study/`	Taxonomy pilot conversations and per-pair pairwise judge outputs (4 overlap topics; human researcher engaging Claude Opus 4.7). Used in Appendix A.
`manipulation_check.json`	Aggregated manipulation-check statistics across primary and replication studies (Appendix C).
`tier1_bookend_*.log`	Run logs for the non-think bookend experiments. Diagnostic only.

Conversation file format

Each <task>_run<N>.json contains:

{
  "task_id": "A2",
  "condition": "critical_L3",
  "user_model": "gemini-3.1-pro",   // stale label, see note below
  "focal_model": "deepseek/deepseek-chat",
  "judge_model": "gpt-5.4",
  "turns": [
    { "role": "user",  "content": "..." },
    { "role": "focal", "content": "..." },
    ...
  ],
  "closing_response": "...",
  "rubric_scores": { ... }
}

Known stale label: `user_model: gemini-3.1-pro`

The user_model field in phase2/config.yaml and the per-conversation JSONs reads gemini-3.1-pro. This is a stale YAML key, not a true reflection of the runtime: the production user-LLM was Claude Sonnet 4.6, switched in phase 1.5 of the project (see docs/phases/phase-1.5-retro.md). The config key was never updated. Paper §4 (Models) is authoritative on which models were used.

Reproducing the paper's numbers

Every figure and reported number regenerates from paper/figures/ (figures) and paper/analysis/ (prose numbers), reading from this directory. Run from the repo root:

uv sync
uv run python paper/figures/fig2_dimension_heatmap.py   # example

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

Experiment data

Directory layout

Conversation file format

Known stale label: `user_model: gemini-3.1-pro`

Reproducing the paper's numbers

Name		Name	Last commit message	Last commit date
parent directory ..
art_study		art_study
phase2		phase2
pilot		pilot
replication_grok		replication_grok
tier1		tier1
tier1_all		tier1_all
tier1_bookend		tier1_bookend
README.md		README.md
manipulation_check.json		manipulation_check.json
tier1_bookend_A2.log		tier1_bookend_A2.log
tier1_bookend_A5.log		tier1_bookend_A5.log
tier1_bookend_C1.log		tier1_bookend_C1.log
tier1_bookend_C2.log		tier1_bookend_C2.log

FilesExpand file tree

experiments

Directory actions

More options

Directory actions

More options

Latest commit

History

experiments

Folders and files

parent directory

README.md

Experiment data

Directory layout

Conversation file format

Known stale label: user_model: gemini-3.1-pro

Reproducing the paper's numbers

Known stale label: `user_model: gemini-3.1-pro`