Skip to content

Latest commit

 

History

History

README.md

Experiment data

Raw conversation transcripts and judge outputs for Same Question, Different Answer: Latent Quality in LLMs Under Critical Engagement. Everything the paper's analysis scripts (paper/figures/*.py, paper/analysis/*.py) reads lives here.

Directory layout

Path What it is
phase2/ Primary study: 14 open-ended + 2 objective tasks, 9 engagement conditions, DeepSeek V4 Pro focal model. Includes per-condition closing rubric scores and pairwise judge verdicts.
replication_grok/ Replication study: 6-task subset, Grok 4.3 focal model, user-LLM and judge held constant from the primary study.
tier1/ Comprehensive-critique kept prompt: four tier-1 tasks (A2, A5, C1, C2), three runs each. The production prompt used in phase2/critical_L3/ (mirrored here for the prompt-ablation comparison in Appendix B).
tier1_all/ Comprehensive-critique persistence-directive variant: same four tasks and runs, ablation prompt with three added persistence directives. Used in Appendix B.
tier1_bookend/ Non-think reasoning-mode bookend: four tier-1 tasks under three primary conditions, focal model run without extended-reasoning enabled. Used in Appendix D.
pilot/ Pre-study sample used to validate the conversation runner. Not part of any reported result.
art_study/ Taxonomy pilot conversations and per-pair pairwise judge outputs (4 overlap topics; human researcher engaging Claude Opus 4.7). Used in Appendix A.
manipulation_check.json Aggregated manipulation-check statistics across primary and replication studies (Appendix C).
tier1_bookend_*.log Run logs for the non-think bookend experiments. Diagnostic only.

Conversation file format

Each <task>_run<N>.json contains:

{
  "task_id": "A2",
  "condition": "critical_L3",
  "user_model": "gemini-3.1-pro",   // stale label, see note below
  "focal_model": "deepseek/deepseek-chat",
  "judge_model": "gpt-5.4",
  "turns": [
    { "role": "user",  "content": "..." },
    { "role": "focal", "content": "..." },
    ...
  ],
  "closing_response": "...",
  "rubric_scores": { ... }
}

Known stale label: user_model: gemini-3.1-pro

The user_model field in phase2/config.yaml and the per-conversation JSONs reads gemini-3.1-pro. This is a stale YAML key, not a true reflection of the runtime: the production user-LLM was Claude Sonnet 4.6, switched in phase 1.5 of the project (see docs/phases/phase-1.5-retro.md). The config key was never updated. Paper §4 (Models) is authoritative on which models were used.

Reproducing the paper's numbers

Every figure and reported number regenerates from paper/figures/ (figures) and paper/analysis/ (prose numbers), reading from this directory. Run from the repo root:

uv sync
uv run python paper/figures/fig2_dimension_heatmap.py   # example