Skip to content

Latest commit

 

History

History
638 lines (431 loc) · 40.8 KB

File metadata and controls

638 lines (431 loc) · 40.8 KB

Critical Thinking & LLM Output Quality: Research Design

1. Core Thesis & Framing

The Observation

Even with today's most powerful models, conversations on complex or open-ended topics get steered in very different directions depending on the user's critical thinking — their ability to identify weak points, push back on glazed-over reasoning, reframe the discourse, and synthesize arguments carefully. Discerning users have noticed this; alignment and sycophancy researchers know narrow versions of it. But the broader claim hasn't been tested rigorously.

Latent Quality Headroom

Core question: Is there a measurable quality ceiling for AI-assisted cognition that is only reached through active intellectual engagement?

LLM default outputs are a biased sample from a richer internal quality distribution. Training incentives (RLHF, helpfulness optimization) push generation toward plausible and agreeable responses rather than maximally rigorous ones. Critical engagement — sustained dialogic pushback, reframing, error identification, and precision demands across conversation turns — shifts generation toward higher-quality regions.

Mechanism: Attentional redirection. Critical engagement functions as attentional redirection within the model's existing knowledge space. The model possesses the knowledge needed for higher-quality responses — counterexamples, caveats, alternative framings — but default generation, optimized for plausibility and agreeableness, doesn't activate this knowledge. Critical moves redirect attention: evaluative moves point to known errors, elicitation forces retrieval of latent justifications, generative moves point to relevant knowledge the model didn't connect to the current context, calibration forces self-assessment against standards the model knows but doesn't apply. The user doesn't inject new content; they redirect which existing knowledge gets activated.

This study measures how much latent quality headroom exists and what behavioral patterns unlock it.

Key Distinction: This Is Not Prompt Engineering

Prompt engineering is about the initial query. This claim is about sustained dialogic critical engagement across turns — a fundamentally different skill. It's about what happens in turns 3-15, not turn 1.

Three Versions of the Claim

Version Comparison Testability Interest
A (superlinear) LLM + critical thinker > critical thinker alone Hard but provocative High
B (behavioral) LLM + critical thinker > LLM + uncritical thinker Most testable Medium-High
C (tautological) Better thinkers get better results Almost tautological Low

Ideal 2×2 design tests both A and B simultaneously:

No LLM With LLM
Critical engagement Baseline (solo expert) The interesting cell
Passive engagement Weaker baseline "Vibes-based" LLM use

Book-Ending Strategy

This study establishes the extremes (passive floor vs. critical ceiling). If the gap is large enough, follow-up studies explore the realistic middle ground (selective engagement, minimum effective dose, where-in-conversation-to-engage).

LLM-LLM Simulation as Lower Bound

No human participants available. All experiments use LLM-LLM interactions: a focal LLM (the one being evaluated) and a user LLM (simulating human behavior via system prompts).

What we're actually testing: Does the focal LLM have latent quality that is only unlocked by critical engagement patterns? This is a claim about the model, not the user.

The "necessary condition" argument: If the model doesn't respond differently to critical engagement, the human claim is dead. If it does, the human claim is plausible and worth testing with real participants. We establish the mechanism; the human-level claim follows.

Qualified lower bound: The simulation is a lower bound for generative moves — human domain experts are better "pointers" because they know which counterexamples are maximally incisive and which reframings cut to the heart of the issue. For evaluative and elicitation moves, the LLM user may be near-optimal (consistent, tireless, comprehensive), meaning the simulation may be near-ceiling for those move types. A third dimension — strategic deployment (knowing when to push and when to accept) — is a genuinely human-advantage capability that the system-prompt-driven user LLM captures poorly. Overall, if the simulation shows a gap, the human effect is plausibly larger for L3 but may not be much larger for L1/L2. Either outcome is informative.


2. Why This Is Hard to Test

The Fundamental Identification Problem

Isolating the causal effect of critical thinking behavior in the conversation from critical thinking ability that would have produced a good answer anyway. If a senior researcher pushes back on a model's flawed climate economics reasoning, are they getting a better answer because of the pushback, or because they already knew enough to write the answer themselves?

Key Confounds

Confound Threat Control strategy
Domain expertise Experts push back more AND produce better work independently Within-subjects design; test on unfamiliar domains
Prompt quality Better thinkers write better initial prompts Standardize first prompt, only vary follow-up behavior
Task difficulty Easy tasks don't need pushback Calibrated tasks with known difficulty
Model stochasticity Same conversation could yield different outputs Multiple runs; low temperature settings
Time/effort Critical thinkers spend more turns — is it just effort? Control for turn count; include "effortful but uncritical" condition

The Time Confound (Unpacked)

If someone who pushes back spends 8 turns and gets excellent output, while a passive user spends 2 turns and gets mediocre output — is the difference about critical thinking or just effort? This is a real threat because more turns = more model compute = more chances to produce good content.

Solution: The effortful-uncritical condition — same number of turns, same time spent, but engagement is re-prompting/requesting alternatives without substantive challenge. If the critical condition still wins, it's the type of engagement, not the amount.


3. Behavioral Taxonomy of Critical Thinking Moves

Status: Candidate taxonomy. To be validated, refined, and potentially expanded by the art study (Phase 0).

Organized by cognitive function:

Category Move Description Example
Evaluative Error flagging Correct factual mistakes with explanation "That's not right — X works differently because..."
Evaluative Logical gap identification Challenge unjustified inferential steps "Steps 2→3 don't follow. What justifies that?"
Evaluative Assumption surfacing Expose hidden premises, propose alternatives "You're assuming Y, but what if Z?"
Generative Reframing Reconceptualize the problem "The real question isn't X, it's Y"
Generative Counterexample provision Provide a specific case that breaks the model's claim "That principle breaks down in the case of X — how do you account for that?"
Generative Steelman + counter Acknowledge, then present stronger objection "I see the case for X, but the stronger objection is..."
Generative Constraint introduction Add a new consideration the model didn't account for "But what about the case where X applies?"
Elicitation Evidence demand Request reasoning for claims "What's the basis for that? Reason through it."
Elicitation Precision demand Force specificity on vague claims "What do you mean by 'significant'? Quantify."
Elicitation Synthesis challenge Test coherence across turns "How does this square with what you said about X?"
Calibration Meta-cognitive probe Ask model to evaluate its own reasoning "What's the weakest part of your argument?"
Calibration Scope correction Flag overweighted/underweighted factors "You're overweighting A and ignoring B entirely"
Calibration Register/output rejection Reject overall quality/depth/originality without citing specific error "Try again, this time resorting less to cached patterns of thinking."
Calibration Sycophancy detection Notice and call out when the model agrees too readily or defaults to agreeableness over honesty "You're agreeing with me to please me, not because you think I'm right"

The taxonomy itself is a contribution — a codebook of critical thinking behaviors in LLM interaction (14 moves across 4 cognitive functions).

Scope note: This taxonomy is derived from one researcher's engagement patterns across conversations with Claude and ChatGPT models. Different expert users with different domain backgrounds may exhibit critical thinking moves not captured here. The taxonomy is a candidate codebook, not a universal classification. Generalizability to other users is an explicit limitation.

Graduated Levels

Three levels, run from start (not deferred):

  • Level 1 (evaluative only): Error flagging + logical gap identification + assumption surfacing
  • Level 2 (+ elicitation): L1 + evidence demand, precision demand, synthesis challenge
  • Level 3 (full taxonomy): L2 + reframing, counterexample, steelman+counter, constraint introduction, meta-cognitive probe, scope correction, register/output rejection, sycophancy detection

The ordering hypothesis: evaluative moves are the "minimum" critical thinking (spot errors), elicitation gets more out of the model, generative and calibration contribute new intellectual substance and force self-assessment. This is a testable ordering. If L1 ≈ L2 ≈ L3, any critical engagement works regardless of sophistication.

Open Questions

  • Should "partial agreement + redirection" be its own move?

4. Experimental Design

4.1 Conditions (9 total)

# Condition User LLM behavior What it tests Focal temp User temp
1 Passive Accept responses. Generic follow-ups ("tell me more", "expand") The floor — default multi-turn quality default 0.7
2 Effortful-uncritical Request alternatives, rephrasings, detail — never challenge substance Is it just more turns/compute? default 0.7
3 Self-critique Focal LLM critiques itself via system prompt; paired with passive user Can the model self-unlock? default 0.7
4 Generic external critique External LLM says "find problems" without specific moves Does any external signal help? default 0.7
5 Critical L1 (evaluative) Error flagging + logical gap identification + assumption surfacing Does spotting errors alone help? default 0.7
6 Critical L2 (+ elicitation) L1 + evidence demand, precision demand, synthesis challenge Does extracting more from the model add value? default 0.7
7 Critical L3 (full taxonomy) All moves including generative + calibration (incl. meta-cognitive probe) Does contributing new substance and forcing self-assessment matter? default 0.7
8 Adversarial Push back on everything, including correct claims Is indiscriminate pushback as good as targeted? default 0.7
9 Best-of-N passive Generate 10 independent single-turn responses, select best via judge Is it just sampling diversity? default N/A

Temperature design: The focal LLM runs at API default temperature (~1.0) for ecological validity — this is how real users experience the model. The user LLM runs at 0.7 to reliably follow its system prompt (delivering the intended engagement style) while providing natural variation across runs. At temp=0 for both models, multiple runs would be deterministic and produce identical conversations, making replication meaningless.

The graduated levels (L1 → L2 → L3) test dose-response directly: does the range of critical moves matter, or is any critical engagement sufficient? The adversarial condition distinguishes targeted critical thinking from indiscriminate disagreement. The self-critique condition tests generation-verification asymmetry. The best-of-N condition controls for the possibility that critical engagement merely increases sampling diversity.

Predicted Hierarchy

Passive ≈ Effortful-uncritical ≈ Best-of-N-passive
     <
Self-critique (modest improvement)
     <
Generic external critique
     ≤
Critical L1 (evaluative)
     <
Critical L2 (+ elicitation)
     ≤
Critical L3 (full taxonomy)
     >
Adversarial (indiscriminate pushback hurts)

Key Comparisons

Comparison If true, it shows...
L3 > Passive Critical engagement helps (the basic claim)
L3 > Effortful-uncritical It's the TYPE of engagement, not the AMOUNT
L3 > Best-of-N-passive It's directional context, not sampling diversity
L3 > Generic critique The specific behavioral moves matter, not just any pushback
L3 > Self-critique External verification is irreplaceable
L3 > Adversarial Targeted pushback > indiscriminate disagreement
L2 > L1 Elicitation adds value beyond error-spotting
L3 > L2 Generative contributions add value beyond elicitation
L1 ≈ Generic If true, evaluative moves ARE generic critique (no taxonomy needed for L1)
Generic > Self-critique External input adds value beyond self-reflection
Effortful ≈ Passive Confirms effort alone doesn't help

4.2 Task Bank

Split: ~7-8 objective tasks + ~18-20 open-ended tasks (~25 total).

The hypothesis predicts the effect is larger for open-ended tasks. If critical engagement shows NO effect on objective tasks but a LARGE effect on open-ended tasks, that's an interesting finding: critical thinking matters precisely where there's no single right answer.

Open-Ended Tasks (~18-20)

Category A — Analysis & argument (6-7 tasks)

  • A1: Technology strategy under uncertainty — a mid-size semiconductor company with ReRAM IP analyzing strategic options (double down, pivot to CXL, license IP)
  • A2: The replication crisis — structural problem or self-correcting science? Construct and steelman both positions
  • A3: Should AI labs publish frontier model weights? Policy analysis considering security, progress, competition, democratic access
  • A4: Simulation vs. experiment — when should regulators accept computational simulation evidence?
  • A5: Is systems thinking a genuine intellectual framework or an aesthetic? When does it add explanatory power vs. re-describing?
  • A6: The automation paradox in knowledge work — as AI automates more, does remaining work become more or less valuable?
  • A7: Scaling laws as epistemology — empirical regularities vs. fundamental principles? How should they inform investment?

Category B — Design & synthesis (5-6 tasks)

  • B1: Design an evaluation framework for LLM-assisted decision-making in high-stakes domains
  • B2: Design a mechanism for funding public goods research (lottery, retroactive, quadratic, hybrid)
  • B3: Design a curriculum for teaching critical thinking in the age of AI
  • B4: Design an architecture for trustworthy multi-agent AI systems with consensus requirements
  • B5: Design a framework for measuring 'AI readiness' of organizations beyond checklists

Category C — Causal/mechanistic reasoning (4-5 tasks)

  • C1: Why do large organizations systematically underinvest in maintenance?
  • C2: Why does interdisciplinary research underperform expectations?
  • C3: The semiconductor industry's consolidation — inevitable or contingent?
  • C4: Why do prediction markets underperform their theoretical promise?

Category D — Cross-domain synthesis (3-4 tasks)

  • D1: What can semiconductor fabrication teach us about AI safety?
  • D2: Physics intuitions that transfer (or don't) to ML
  • D3: Behavioral economics of API design

Objective Tasks (~7-8)

  • O1-O3: Multi-constraint design problems (conference scheduling, power delivery network, research budget allocation)
  • O4-O5: Multi-source synthesis with contradictions (sleep/cognition abstracts, NAND endurance reports)
  • O6-O8: Specification-sensitive tasks (ambiguous coding spec, statistical claim evaluation, Fermi estimation)

[OPEN]: Objective tasks may need further redesign. Frontier models are strong on standard objective benchmarks. Current proposals attempt to find tasks where models reliably produce flawed-but-recoverable first responses, but this remains unvalidated.

Task Tiering (Staged Spending)

  • Tier 1 (highest expected effect, test first): A5, A6, A2, A7, C2, C1, D1, B2, O4, O6, O7, O5
  • Tier 2 (test after Tier 1 promising): A1, A3, A4, B1, B3, B4, B5, C3, C4, D2, D3, O1, O2, O3, O8

[PENDING]: Tiering subject to revision after ChatGPT conversation analysis.

4.3 Taxonomy Injection Strategy

Decision: Graduated from start (3 levels).

  • L1 (evaluative): Error flagging, logical gap identification, assumption surfacing
  • L2 (evaluative + elicitation): L1 + evidence demand, precision demand, synthesis challenge
  • L3 (full taxonomy): L2 + reframing, counterexample, steelman+counter, constraint introduction, meta-cognitive probe, scope correction, register/output rejection, sycophancy detection

This replaces a single "taxonomy-informed critique" condition with three graduated conditions (L1/L2/L3), yielding 9 total conditions. Tests dose-response directly rather than relying on post-hoc annotation.

Post-hoc annotation of which specific moves were deployed within each level remains valuable as exploratory analysis (e.g., which L2 moves drive the most quality lift?).


5. Art Study (Phase 0)

Purpose

  1. Discover and validate the behavioral taxonomy of critical thinking moves in LLM conversations
  2. Extract real conversational patterns to make user-LLM system prompts empirically grounded
  3. Understand how moves combine, sequence, and interact in natural conversation

Design

4 topics × 4 engagement styles = 16 conversations

Topics (4 to be selected from 8 candidates)

Must span 3 expertise zones:

Zone 1 — Deep expertise (select 2)

  • 1a: NAND scaling limits and the future of storage architecture — physics walls vs. engineering challenges, 500+ layer viability, alternative memory technologies, marketing vs. physics
  • 1b: When does simulation fail? The epistemology of computational models — discretization error, model error, parametric uncertainty, V&V vs. UQ, philosophy of models
  • 1c: Scaling laws, emergent capabilities, and what they actually predict — power laws vs. capabilities, Schaeffer critique, extrapolation priors, phase transition analogy

Zone 2 — Adjacent expertise (select 1)

  • 2a: The hard problem of consciousness and its implications for AI — explanatory gap vs. confusion, functionalism, substrate-independence, consciousness audits
  • 2b: Nudge theory under scrutiny — replication failures, nudge vs. manipulation, structural reform, when behavioral science helps policy
  • 2c: Emergence in complex systems — explanatory power vs. intellectual placeholder, weak vs. strong emergence, systems thinking's falsifiability

Zone 3 — Stretch domains (select 1)

  • 3a: Industrial policy — CHIPS Act, empirical track record, why economists disagree, semiconductor case as special
  • 3b: Antibiotic resistance — market failure, commons tragedy, innovation problem, agricultural antibiotics

Selected: 1b (Simulation epistemology), 1c (Scaling laws), 2a (Hard problem of consciousness), 3a (Industrial policy / CHIPS Act). Reserve topics available if needed.

Engagement Styles (per topic)

Style What the user does Annotation focus
Passive Accept responses, generic follow-ups Baseline quality; model's default register
Effortful-uncritical Same turn count as critical; request alternatives, detail, examples — never challenge substance Controls for effort
Critical Full taxonomy of moves The treatment; which moves drive quality shifts?
Passive → critical Passive for first 5 turns, then shift to critical Tests whether late engagement can recover from passive start

Annotation Scheme

Approach: Hybrid — solo open coding first, then independent LLM second pass, then compare.

Per user turn:

  1. Move type(s) — a single turn can contain multiple moves
  2. Intensity — light touch vs. forceful challenge (1-3 scale)
  3. Domain-dependence — did this move require domain knowledge, or was it domain-general?

Per model turn:

  1. Quality — subjective 1-5 overall quality rating
  2. Quality shift — improve / flat / degrade vs. previous model turn (+1/0/-1)
  3. Behavioral markers (check all that apply):
    • Hedging shift, precision shift, self-correction, depth shift, frame adoption, defensive retreat, sycophantic agreement
  4. Latent quality surfaced? — does this response contain substance the model "knew" but wouldn't have produced without critical engagement? (yes/no/unclear)

Saturation Criterion

After each conversation, check: did any new move types emerge? Stop when 2 consecutive conversations yield no new codes. Minimum 5 conversations regardless.

Parameters

  • Length: 12-18 turns per conversation (target 15)
  • Platform: Claude.ai
  • Model: Opus (richest responses and subtlest errors for art study purposes)
  • Max 2 conversations per session to avoid fatigue/pattern-lock

Taxonomy Freeze

Status: FROZEN (2026-05-03)

The behavioral taxonomy is finalized at 14 moves across 4 categories. Validated via art study: 12 conversations (4 passive, 4 effortful-uncritical, 4 critical) across 4 topics spanning 3 expertise zones. Saturation confirmed — no new move types emerged in the final 2 critical conversations.

Annotation findings incorporated:

  • Assumption surfacing is often co-deployed with other moves (especially precision demand and error flagging) rather than appearing standalone. Kept as distinct move — it represents a distinct cognitive operation regardless of delivery vehicle.
  • Self-referential consistency check (applying model's own stated criteria against its claims) is a high-value variant of synthesis challenge, not a separate move.
  • Domain-expertise correction, deflation resistance, and sycophancy-adjacent pattern detection are subtypes of existing moves (error flagging, scope correction, sycophancy detection respectively).

Phase 0 is exploratory; Phase 1+ is confirmatory. Any taxonomy changes discovered during the experiment are documented but do not alter the experimental conditions mid-run.

Output

  1. Finalized behavioral taxonomy (with real examples)
  2. Move frequency distribution
  3. Move co-occurrence patterns
  4. Domain-dependence classification
  5. Principled groupings → graduated prompt levels
  6. Template system prompts for each condition

6. Evaluation

Open-Ended Tasks: LLM-as-Judge

Judge model: Different from both the focal LLM and the user LLM (prevents model-specific biases).

Rubric dimensions (scored 1-5 each):

Dimension Weight What it measures Anchors
Factual accuracy & calibration ×1.0 Are claims well-grounded and is confidence proportionate? 1=major unsupported claims, 3=mostly supported, 5=well-grounded with calibrated uncertainty
Logical coherence ×1.0 Does the argument follow? 1=non-sequiturs, 3=mostly coherent, 5=airtight
Depth/nuance ×1.5 Does it go beyond surface-level? 1=superficial, 3=competent, 5=expert-level
Completeness ×1.0 Are important considerations covered? 1=major omissions, 3=covers main points, 5=comprehensive
Intellectual honesty ×1.5 Acknowledges uncertainty, limitations, counterarguments? 1=overconfident, 3=some hedging, 5=calibrated
Originality ×0.75 Non-obvious insights or connections? 1=generic/boilerplate, 3=competent, 5=genuinely insightful

Sensitivity analysis: show results hold across equal-weight and alternative weighting schemes. Pre-registered alternative: originality at ×1.5 (art study showed novel reframing is one of critical engagement's most distinctive contributions; if the effect is stronger under this weighting, it indicates where quality improvement concentrates).

Standardized closing prompt: Every conversation across all 9 conditions ends with a standardized final prompt from the user LLM that produces the evaluable artifact. This prompt is identical across all conditions:

"Now, synthesizing everything from our discussion, provide your most complete and considered response to the original question: [original opening prompt repeated verbatim]."

This ensures: (a) the final response is comparable across conditions — it addresses the same question, (b) it captures whatever quality improvement the conversation produced — the model synthesizes everything it has learned/revised, (c) it is a self-contained artifact scorable without conversation history. Without this standardization, passive final turns ("anything else?") and critical final turns (response to a specific pushback) are incommensurable objects.

Scoring procedure:

  1. Judge sees ONLY the standardized final response, not conversation history
  2. The focal LLM system prompt instructs the model to produce self-contained responses (not referencing conversation context like "as you mentioned earlier"), reinforcing comparability
  3. Judge scores against rubric without knowing which condition produced it
  4. Each output scored 3 times (stability check)
  5. Validate on 10% subset against personal ratings

Objective Tasks: Automated Scoring

Task type Scoring method
Constraint satisfaction Binary per constraint + total count
Multi-source synthesis Checklist: contradictions identified, methodological differences noted, conclusions scoped
Specification-sensitive Checklist: ambiguities flagged, interpretations addressed
Fermi estimation Within 1 order of magnitude + key factors identified

What Gets Evaluated

  • Final output: Last substantive model response
  • Trajectory: Quality at turn 1, 5, 10, and final turn (captures both "preventing degradation" and "unlocking quality")
  • Per-turn improvement: Average quality delta per turn (efficiency metric)
  • Degradation detection: Within passive condition, does turn-10 quality < turn-1 quality?
  • Latent quality detection: Does critical turn-10 exceed ALL conditions' turn-1 quality?

Trajectory analysis requires LLM-as-judge scoring at 4 checkpoints per conversation (~3× judge API cost).


7. Models

Main Experiment

Role Model Provider Pricing ($/M tokens) Mode Rationale
Focal LLM DeepSeek V4 Pro DeepSeek $1.74 in / $3.48 out Real-time Frontier open-weights; both non-think and think-high bookends
User LLM Claude Sonnet 4.6 Anthropic $3.00 in / $15.00 out Real-time Frontier; strong system prompt following; no daily rate limits
Judge LLM GPT-5.4 OpenAI $2.50 in / $15.00 out Real-time Strong evaluator; temperature=0 for reproducibility

Replication (deferred pending main effect)

Role Model Provider Rationale
Focal LLM Gemini 3.1 Pro Google Different architecture from DeepSeek; cross-model generality
User LLM Claude Sonnet 4.6 Anthropic Same user as main — isolates focal model as only variable
Judge LLM GPT-5.4 OpenAI Same judge as main — scores are comparable across experiments

Three independent providers per experiment (DeepSeek/Anthropic/OpenAI for main, Google/Anthropic/OpenAI for replication). User and judge held constant across main and replication — only the focal changes, giving the cleanest possible cross-model comparison.

Key design choices:

  • User LLM must be real-time (turn-by-turn conversation with focal). Batch not possible.
  • User LLM must be frontier-tier: the user-focal interaction is a sparring contest. A weak user produces weak critiques and the dose-response curve flattens for artifactual reasons.
  • User and judge from different providers than focal prevents same-family biases.
  • Claude Sonnet 4.6 chosen over Gemini 3.1 Pro due to Gemini's 250 requests/day rate limit (would stretch experiment to ~20 days).

Evaluation checkpoints: 2 per conversation (turn 1 response + standardized closing response). Both answer the original question, making them directly comparable. Intermediate turns are responses to specific follow-ups and not commensurable with opening/closing.


8. Metrics

Primary

Final-turn composite score — weighted average of 6 rubric dimensions (see §6 for weights).

Secondary

  • Per-dimension scores at final turn
  • Quality trajectory slope (improvement rate per turn)
  • Turn-1 vs. final-turn delta (within-conversation quality change)

Exploratory

  • Per-move quality lift (which taxonomy moves predict biggest quality jumps?)
  • Degradation analysis: does passive condition quality decline over turns?
  • Cross-model comparison: does the effect size differ between focal LLMs?

9. Analysis Plan

Primary Analyses

  1. Output quality across conditions (ANOVA or Kruskal-Wallis depending on distributional assumptions)
  2. Pairwise comparisons (see key comparisons table in §4.1)
  3. Effect sizes (Cohen's d), not just p-values

Secondary Analyses

  1. Dose-response: which specific taxonomy moves predict quality improvement? (requires post-hoc annotation of critical condition)
  2. Task-type interaction: effect for objective vs. open-ended tasks
  3. Turn/time analysis: quality-per-turn across conditions
  4. Cross-model: does the effect hold across different focal LLMs?

Dose-Response Analysis for Graduated Levels

Instead of pairwise L1-vs-L2 and L2-vs-L3 tests (which are underpowered for medium effects at n=12 tasks), use a monotonic trend test. Mixed-effects model with task as random effect, level as ordinal fixed effect (1, 2, 3), and run as replicate. This uses all three groups simultaneously and has substantially more power than pairwise comparisons.

Pairwise comparisons are secondary. Report effect sizes (Cohen's d) with 95% CIs regardless of significance.

Power Estimates

Paired design, α=0.05:

Comparison Tasks Runs Expected d Power
L3 vs Passive 12 (Tier 1) 3 >1.0 >0.89
L3 vs Effortful 12 3 0.7-1.0 0.72-0.89
L1-L2-L3 trend 12 3 0.5 (per step) ~0.65 (trend test)
L2 vs L1 pairwise 27 (all) 5 0.5 ~0.70

The go/no-go gate (after Tier 1) tests L3 > Effortful, which is well-powered even with 12 tasks. The dose-response trend is adequately powered with all tasks. Individual pairwise comparisons between adjacent levels are exploratory, not confirmatory.

Post-Conversation Probe (Supplementary)

Mechanism test: After each L3 conversation, prompt the focal model independently: "Given the topic [X], what are the strongest counterarguments to the mainstream position? What alternative framings might be more productive? What are the key assumptions that should be questioned?"

If the model produces the same insights that the user LLM "pointed at" during conversation, this supports the attentional redirection interpretation (the quality was latent). If it cannot, the user LLM contributed something genuinely novel beyond pointing. This is a low-cost supplementary analysis (~1 additional API call per L3 conversation).

Key Predictions

  • L3 > Passive (expected, almost certain)
  • L3 > Effortful-uncritical (the important test — rules out "just more effort")
  • L3 > Best-of-N-passive (rules out "just sampling diversity")
  • L3 > Adversarial (validates that targeted pushback matters)
  • L3 > L2 > L1 (dose-response — more move categories = better output)
  • L1 ≈ Generic critique (evaluative moves may be equivalent to unstructured "find problems")
  • Effect larger for open-ended tasks than objective tasks (hypothesis)

10. Spending Plan (Staged)

Phase What Tasks Conditions Runs Est. conversations Est. cost (DeepSeek)
Phase What Convs Focal User (Gemini 3.1 Pro) Judge (GPT-5.4) Total
------- ------ ------- ------- ---------------------- ---------------- -------
Pilot Thinking mode + validation ~27 ~$1 ~$4 ~$1 ~$6
Phase 1 Tier 1, think-high (all 9 conds) ~324 ~$37 ~$181 ~$16 ~$235
Phase 1b Non-think bookend (passive + L3) ~72 ~$3 ~$40 ~$4 ~$47
Phase 2 Extend to 5 runs +216 ~$11 ~$97 ~$11 ~$119
Phase 3 Tier 2 tasks ~405 ~$20 ~$183 ~$20 ~$223
Phase 4 Claude Sonnet replication deferred
Committed (Phase 1+1b) ~396 ~$23 ~$182 ~$20 ~$226

Revised cost reality (from pilot data):

  • Gemini 3.1 Pro dominates costs (~80% of total). ~$0.33/conv passive, ~$0.67/conv critical.
  • DeepSeek V4 Pro is cheap. ~$0.05/conv non-think.
  • Judge (GPT-5.4) is cheap. ~$0.05/conv (3× scoring).
  • Original $163 estimate undercosted Gemini by ~2×.

Bookend design (from pilot Goal 1 findings):

  • Non-think: incisive, hedged, compressed — naturally honest. All 9 conditions.
  • Think-high: assertive, thorough, confident — overconfident in presentation. Passive + L3 only.
  • Think-max ≈ think-high on analytical tasks (same token usage). No separate max level.
  • Bookend tests whether critique can crack think-high's overconfidence in addition to unlocking non-think's headroom.

Decision gates:

  • After Phase 1: Is L3 > effortful-uncritical with medium+ effect size? If no → stop or redesign.
  • After Phase 1: Does the effect hold at both thinking mode bookends?
  • After Phase 2: Are results stable with more runs? Proceed to Tier 2.
  • After Phase 1 results: revisit Claude replication scope and budget.

10.1 Pilot Results: Thinking Mode Comparison

Design: 2 tasks (A5, C1) × 3 runs × {non-think, think-high} × passive = 12 conversations.

Findings:

Metric Non-think Think-high
Mean composite 4.42 4.17
Std deviation 0.11 0.24
Depth/nuance 4.83 4.17
Intellectual honesty 4.00 3.50

Qualitative analysis: Think-high produces more assertive, thorough, encyclopedia-like responses. Non-think produces more incisive, hedged, essay-like responses. Think-high resolves ambiguity internally and presents resolved conclusions confidently; non-think preserves ambiguity in the output.

Decision: Run BOTH as bookends. Think-high for all 9 conditions (primary analysis) — ecologically valid, more headroom for dose-response gradient (intellectual honesty at 3.5 gives room for critique to push to 4-5). Non-think for passive + L3 only (robustness check — tests if the effect survives against an already-nuanced baseline where headroom is smaller). Think-max ≈ think-high on analytical tasks (same reasoning token usage), so no separate max level.

Rationale for think-high as default: The dose-response curve (L1→L2→L3) is the primary contribution and needs enough headroom to show a gradient. Non-think at 4.42 baseline may be too close to ceiling. Think-high at 4.17 with overconfident presentation gives critique more to work with. The non-think bookend then answers the stronger scientific question: does the effect hold even when the model is already naturally nuanced?

Cross-judge comparison: Deferred. Human validation subset (10% of responses rated by the researcher) is the calibration check for judge bias.


11. Related Work & Positioning

Key Papers

  1. "Sycophancy Is Not One Thing" (ICLR 2026) — sycophantic agreement, genuine agreement, and praise are distinct, independently steerable behaviors in latent space. Critical engagement may suppress sycophancy as a separable mechanism.

  2. "Another Turn, Better Output?" (2025) — turn-wise analysis of iterative prompting. 4th iteration offers negligible/negative gains. Our study extends: does turn type matter, not just count?

  3. Multi-turn performance degradation — 39% avg performance drop in multi-turn vs. single-turn. GPT-4o: 14.1% conversation correctness in complex scenarios. Critical engagement may prevent degradation, not just unlock quality.

  4. MAD gains are mostly majority voting (ICLR 2025) — multi-agent debate doesn't consistently outperform simpler strategies. Strengthens our case: generic disagreement ≠ targeted critical engagement.

  5. Self-Refine — ~20pp gains with iterative self-feedback, but doesn't differentiate feedback quality. Our effortful-uncritical vs. critical comparison directly tests this.

  6. Generation-verification asymmetry — learning to generate doesn't improve self-verification, but learning to verify DOES improve generation. Critical engagement = external verification, which is more effective than self-supply.

  7. Automation bias in LLM-trained physicians (2025) — erroneous LLM recommendations degrade diagnostic performance even in AI-trained MDs. CRT study: warning nudges almost double performance vs. faulty AI.

  8. GenAI reduces critical thinking (survey, 2025) — knowledge workers self-report shift from active problem-solving to passive verification.

  9. Human-AI complementarity needs augmentation, not emulation (Nature Reviews Psychology, 2026) — effective teaming via complementary strengths.

Positioning Strategy

  • vs. MAD: Quality of critique matters (our effortful-uncritical control is the single strongest differentiator — nobody in MAD has tested "same effort, different substance"). Also: empirically grounded behavioral taxonomy, dose-response via graduated levels.
  • vs. Self-Refine: We test feedback quality, not just iteration count.
  • vs. automation bias: We quantify the gap AND show it's recoverable through specific engagement patterns.
  • vs. "just improve the model": Generation-verification asymmetry means external critical engagement provides something the model can't self-supply.

Papers to Read in Full Before Writing

  • "Another Turn, Better Output?" — methodology and findings
  • "Sycophancy Is Not One Thing" — mechanistic angle
  • Nature Reviews Psychology 2026 — complementarity framing
  • CRT + nudges study — methodological inspiration

12. Deeper Questions

If an LLM prompted to be critical can unlock better outputs from another LLM, why doesn't the focal LLM just produce that quality in the first place?

  • Default outputs optimize for plausibility/agreeableness, not correctness
  • Models have "knowledge" they don't surface without challenge
  • Generation vs. verification asymmetry (easier to recognize good reasoning than produce it unprompted)
  • Training incentives: RLHF rewards helpful-seeming responses, not maximally rigorous ones

This is a property of model architecture and training — testable and interesting independent of the human claim.


13. Execution Structure

Stages (headline only — detailed planning per phase):

  • Stage 0: Art Study & Taxonomy (~15h)
    • Phase 0.1: Topic selection & conversation planning
    • Phase 0.2: Conduct art study conversations
    • Phase 0.3: Annotation & taxonomy derivation
    • Phase 0.4: System prompt authoring
  • Stage 1: Experiment Infrastructure (~20h)
  • Stage 2: Tier 1 Experiment (~15h)
  • Stage 3: Tier 2 & Replication (~20h)
  • Stage 4: Writeup (~15h)

Detailed phase breakdowns live in docs/phases/.


14. Pending Decisions

  • Final art study topic selection1b, 1c, 2a, 3a (4 reserve topics available)
  • ChatGPT export analysis → Confirmed topic selection and taxonomy coverage (12/13 moves pre-sycophancy-detection). Sycophancy detection added as 14th move based on cross-platform evidence.
  • Finalize task tiering (Tier 1 vs. Tier 2)
  • Graduated levels: from start (3 levels) vs. add later?Graduated from start (3 levels)
  • Objective task redesign (current proposals may be too easy for frontier models)

15. Target Venue

Research blog post (Anthropic-style) or workshop paper (CHI, CSCW, NeurIPS workshop). Rigorous but accessible to both researchers and practitioners.

Paper Structure

  1. Introduction: LLMs are powerful but passive use underperforms. Critical thinking's role hasn't diminished.
  2. Behavioral taxonomy of critical thinking moves in LLM interaction (contribution #1)
  3. Experimental design: art study → grounded LLM-LLM simulation
  4. Results (contribution #2)
  5. Discussion: implications for human LLM use, agentic system design, sycophancy (contribution #3)
  6. Limitations: LLM-LLM as proxy (acknowledged, with "necessary condition" argument)