Skip to content

Latest commit

 

History

History
125 lines (105 loc) · 5.85 KB

File metadata and controls

125 lines (105 loc) · 5.85 KB

Critical Thinking & LLM Output Quality

This file is the entry point for a Claude Code (or any agent) session on this repository. It captures the project thesis, the final state, and the working conventions used throughout. It is intentionally internal-voice; a human reader should start with README.md instead.

Final State

  • Paper shipped. All four stages complete. Manuscript at paper/main.pdf (and paper/main.tex). The shipping branch is phase-4-writeup.
  • Headline result. Critical engagement substantially exceeds best-of-$N$ sampling, and decomposes into two structurally distinct modes: evaluative critique drives epistemic calibration ($d = 1.51$), comprehensive critique drives analytical novelty ($d = 2.54$). Replicates across focal models (DeepSeek V4 Pro → Grok 4.3). See paper §5.
  • Production stack. Focal: DeepSeek V4 Pro (primary), Grok 4.3 (replication). User-LLM: Claude Sonnet 4.6. Judge: GPT-5.4. Four independent providers; intentional cross-laboratory triangulation.

Project Thesis

LLM default outputs represent a biased sample from a richer internal quality distribution. Critical engagement — sustained dialogic pushback, reframing, and error identification across conversation turns — shifts generation toward higher-quality regions. This study measures how much latent quality headroom exists and what behavioural patterns unlock it.

Structure (as it shipped)

Stage What it was
0 Art study + 14-move taxonomy (4 humans-engaging-Opus conversations, derivation and freeze of the move set).
1 Experiment infrastructure (system-prompt scaffolding, conversation runner, judge pipeline).
2 Tier 1 experiment (14 open-ended + 2 objective tasks, 9 engagement conditions, primary study on DeepSeek V4 Pro).
3 Tier 2 + replication (Grok 4.3 replication on 6-task subset, non-think bookend, persistence-directive ablation).
4 Writeup (paper + appendices A–G; this is the shipping stage).

Each stage was broken into phases; phase plans live in docs/phases/phase-X.Y-plan.md and retros in docs/phases/phase-X.Y-retro.md. The retros are the authoritative project journal.

Ground Rules (durable; useful for any continuation)

Workflow

  1. Plan mode for any non-trivial task. Enter plan mode for any task with 3+ steps or architectural decisions. Write detailed specs upfront. If things go sideways, STOP and re-plan immediately.
  2. TDD. Define "done" before doing work. For code: write failing tests first. For research: write the falsifiable hypothesis and evaluation criteria first.
  3. Only plan the current phase in detail. Future phases stay at headline level. Anything else is waterfall in disguise.
  4. Verification before done. Never mark a task complete without proving it works. Ask: "Would a staff engineer approve this?"
  5. Objective before subjective. Run automated/quantitative checks before qualitative review.
  6. Separation of concerns. Docs drive design decisions; code is a tool.
  7. Subagent strategy. Use subagents liberally to keep main context window clean. One task per subagent.
  8. Autonomous bug fixing. When given a bug report, just fix it. Zero context switching from the user.

Code

  1. Simplicity first. Minimal code, minimal impact. No over-engineering.
  2. No laziness. Root causes only. No temporary fixes. Senior developer standards.
  3. Minimal impact. Touch only what is necessary.
  4. Reproducibility. Pin all parameters, seeds, model versions. Raw data is never modified.
  5. Demand elegance (balanced). For non-trivial changes: pause and ask "is there a more elegant way?" Skip for simple fixes.

Experimental Discipline

  1. Pre-register hypotheses. Write down what you expect and why before running experiments.
  2. Report nulls honestly. Negative results are results. The persistence-directive ablation in Appendix B is a worked example.
  3. Characterise distributions, not just means. Point estimates without uncertainty are insufficient.
  4. Dual validation where feasible. Two independent measurement / evaluation methods (the rubric judge and pairwise judge are this).
  5. Numbers must regenerate from committed scripts. Every reported number maps to a script in paper/figures/ or paper/analysis/ that reads from data/experiments/. No one-time scripts.

If Continuing

Natural next directions, ranked roughly by leverage:

  1. Human-as-user validation at scale. The paper's lower-bound framing rests on a 4-task pilot. A larger human-engagement study on overlap tasks would tighten the LLM-as-user → human bridge and resolve whether the calibration gap holds outside the heavily trained focal model.
  2. Multi-user-LLM ablation. All reported results use Claude Sonnet 4.6 as the critic. A second user-LLM (e.g., Grok 4.3, Gemini 3 Pro) on the same tasks would bound the user-model-specific contribution to the effects.
  3. New task domains. Open-ended analytical reasoning is the tested regime. Adjacent regimes worth probing: design synthesis (more constrained), mathematical proof critique (verifiable on correctness but multi-dimensional on elegance), policy analysis (high-stakes, contested).
  4. Mechanism work. The "attentional redirection" mechanism is proposed as the interpretation consistent with evidence but is not mechanistically tested. Activation patching / steering work on a smaller open-weights focal model could move this from interpretation to causal claim.

Key References (in-repo)

  • North star: docs/conceptual.md
  • Research design: research_design.md
  • Final paper: paper/main.pdf
  • Phase retros: docs/phases/
  • Measurement framework: docs/measurement_framework.md
  • Literature review (Block A only; B/C de-scoped): literature-review/