This file is the entry point for a Claude Code (or any agent) session on this repository. It captures the project thesis, the final state, and the working conventions used throughout. It is intentionally internal-voice; a human reader should start with
README.mdinstead.
-
Paper shipped. All four stages complete. Manuscript at
paper/main.pdf(andpaper/main.tex). The shipping branch isphase-4-writeup. -
Headline result. Critical engagement substantially exceeds best-of-$N$
sampling, and decomposes into two structurally distinct modes:
evaluative critique drives epistemic calibration (
$d = 1.51$ ), comprehensive critique drives analytical novelty ($d = 2.54$ ). Replicates across focal models (DeepSeek V4 Pro → Grok 4.3). See paper §5. - Production stack. Focal: DeepSeek V4 Pro (primary), Grok 4.3 (replication). User-LLM: Claude Sonnet 4.6. Judge: GPT-5.4. Four independent providers; intentional cross-laboratory triangulation.
LLM default outputs represent a biased sample from a richer internal quality distribution. Critical engagement — sustained dialogic pushback, reframing, and error identification across conversation turns — shifts generation toward higher-quality regions. This study measures how much latent quality headroom exists and what behavioural patterns unlock it.
| Stage | What it was |
|---|---|
| 0 | Art study + 14-move taxonomy (4 humans-engaging-Opus conversations, derivation and freeze of the move set). |
| 1 | Experiment infrastructure (system-prompt scaffolding, conversation runner, judge pipeline). |
| 2 | Tier 1 experiment (14 open-ended + 2 objective tasks, 9 engagement conditions, primary study on DeepSeek V4 Pro). |
| 3 | Tier 2 + replication (Grok 4.3 replication on 6-task subset, non-think bookend, persistence-directive ablation). |
| 4 | Writeup (paper + appendices A–G; this is the shipping stage). |
Each stage was broken into phases; phase plans live in
docs/phases/phase-X.Y-plan.md and retros in
docs/phases/phase-X.Y-retro.md. The retros are the authoritative
project journal.
- Plan mode for any non-trivial task. Enter plan mode for any task with 3+ steps or architectural decisions. Write detailed specs upfront. If things go sideways, STOP and re-plan immediately.
- TDD. Define "done" before doing work. For code: write failing tests first. For research: write the falsifiable hypothesis and evaluation criteria first.
- Only plan the current phase in detail. Future phases stay at headline level. Anything else is waterfall in disguise.
- Verification before done. Never mark a task complete without proving it works. Ask: "Would a staff engineer approve this?"
- Objective before subjective. Run automated/quantitative checks before qualitative review.
- Separation of concerns. Docs drive design decisions; code is a tool.
- Subagent strategy. Use subagents liberally to keep main context window clean. One task per subagent.
- Autonomous bug fixing. When given a bug report, just fix it. Zero context switching from the user.
- Simplicity first. Minimal code, minimal impact. No over-engineering.
- No laziness. Root causes only. No temporary fixes. Senior developer standards.
- Minimal impact. Touch only what is necessary.
- Reproducibility. Pin all parameters, seeds, model versions. Raw data is never modified.
- Demand elegance (balanced). For non-trivial changes: pause and ask "is there a more elegant way?" Skip for simple fixes.
- Pre-register hypotheses. Write down what you expect and why before running experiments.
- Report nulls honestly. Negative results are results. The persistence-directive ablation in Appendix B is a worked example.
- Characterise distributions, not just means. Point estimates without uncertainty are insufficient.
- Dual validation where feasible. Two independent measurement / evaluation methods (the rubric judge and pairwise judge are this).
- Numbers must regenerate from committed scripts. Every reported
number maps to a script in
paper/figures/orpaper/analysis/that reads fromdata/experiments/. No one-time scripts.
Natural next directions, ranked roughly by leverage:
- Human-as-user validation at scale. The paper's lower-bound framing rests on a 4-task pilot. A larger human-engagement study on overlap tasks would tighten the LLM-as-user → human bridge and resolve whether the calibration gap holds outside the heavily trained focal model.
- Multi-user-LLM ablation. All reported results use Claude Sonnet 4.6 as the critic. A second user-LLM (e.g., Grok 4.3, Gemini 3 Pro) on the same tasks would bound the user-model-specific contribution to the effects.
- New task domains. Open-ended analytical reasoning is the tested regime. Adjacent regimes worth probing: design synthesis (more constrained), mathematical proof critique (verifiable on correctness but multi-dimensional on elegance), policy analysis (high-stakes, contested).
- Mechanism work. The "attentional redirection" mechanism is proposed as the interpretation consistent with evidence but is not mechanistically tested. Activation patching / steering work on a smaller open-weights focal model could move this from interpretation to causal claim.
- North star:
docs/conceptual.md - Research design:
research_design.md - Final paper:
paper/main.pdf - Phase retros:
docs/phases/ - Measurement framework:
docs/measurement_framework.md - Literature review (Block A only; B/C de-scoped):
literature-review/