research_design.md

Critical Thinking & LLM Output Quality: Research Design

1. Core Thesis & Framing

The Observation

Even with today's most powerful models, conversations on complex or open-ended topics get steered in very different directions depending on the user's critical thinking — their ability to identify weak points, push back on glazed-over reasoning, reframe the discourse, and synthesize arguments carefully. Discerning users have noticed this; alignment and sycophancy researchers know narrow versions of it. But the broader claim hasn't been tested rigorously.

Latent Quality Headroom

Core question: Is there a measurable quality ceiling for AI-assisted cognition that is only reached through active intellectual engagement?

LLM default outputs are a biased sample from a richer internal quality distribution. Training incentives (RLHF, helpfulness optimization) push generation toward plausible and agreeable responses rather than maximally rigorous ones. Critical engagement — sustained dialogic pushback, reframing, error identification, and precision demands across conversation turns — shifts generation toward higher-quality regions.

Mechanism: Attentional redirection. Critical engagement functions as attentional redirection within the model's existing knowledge space. The model possesses the knowledge needed for higher-quality responses — counterexamples, caveats, alternative framings — but default generation, optimized for plausibility and agreeableness, doesn't activate this knowledge. Critical moves redirect attention: evaluative moves point to known errors, elicitation forces retrieval of latent justifications, generative moves point to relevant knowledge the model didn't connect to the current context, calibration forces self-assessment against standards the model knows but doesn't apply. The user doesn't inject new content; they redirect which existing knowledge gets activated.

This study measures how much latent quality headroom exists and what behavioral patterns unlock it.

Key Distinction: This Is Not Prompt Engineering

Prompt engineering is about the initial query. This claim is about sustained dialogic critical engagement across turns — a fundamentally different skill. It's about what happens in turns 3-15, not turn 1.

Three Versions of the Claim

Version	Comparison	Testability	Interest
A (superlinear)	LLM + critical thinker > critical thinker alone	Hard but provocative	High
B (behavioral)	LLM + critical thinker > LLM + uncritical thinker	Most testable	Medium-High
C (tautological)	Better thinkers get better results	Almost tautological	Low

Ideal 2×2 design tests both A and B simultaneously:

	No LLM	With LLM
Critical engagement	Baseline (solo expert)	The interesting cell
Passive engagement	Weaker baseline	"Vibes-based" LLM use

Book-Ending Strategy

This study establishes the extremes (passive floor vs. critical ceiling). If the gap is large enough, follow-up studies explore the realistic middle ground (selective engagement, minimum effective dose, where-in-conversation-to-engage).

LLM-LLM Simulation as Lower Bound

No human participants available. All experiments use LLM-LLM interactions: a focal LLM (the one being evaluated) and a user LLM (simulating human behavior via system prompts).

What we're actually testing: Does the focal LLM have latent quality that is only unlocked by critical engagement patterns? This is a claim about the model, not the user.

The "necessary condition" argument: If the model doesn't respond differently to critical engagement, the human claim is dead. If it does, the human claim is plausible and worth testing with real participants. We establish the mechanism; the human-level claim follows.

Qualified lower bound: The simulation is a lower bound for generative moves — human domain experts are better "pointers" because they know which counterexamples are maximally incisive and which reframings cut to the heart of the issue. For evaluative and elicitation moves, the LLM user may be near-optimal (consistent, tireless, comprehensive), meaning the simulation may be near-ceiling for those move types. A third dimension — strategic deployment (knowing when to push and when to accept) — is a genuinely human-advantage capability that the system-prompt-driven user LLM captures poorly. Overall, if the simulation shows a gap, the human effect is plausibly larger for L3 but may not be much larger for L1/L2. Either outcome is informative.

2. Why This Is Hard to Test

The Fundamental Identification Problem

Isolating the causal effect of critical thinking behavior in the conversation from critical thinking ability that would have produced a good answer anyway. If a senior researcher pushes back on a model's flawed climate economics reasoning, are they getting a better answer because of the pushback, or because they already knew enough to write the answer themselves?

Key Confounds

Confound	Threat	Control strategy
Domain expertise	Experts push back more AND produce better work independently	Within-subjects design; test on unfamiliar domains
Prompt quality	Better thinkers write better initial prompts	Standardize first prompt, only vary follow-up behavior
Task difficulty	Easy tasks don't need pushback	Calibrated tasks with known difficulty
Model stochasticity	Same conversation could yield different outputs	Multiple runs; low temperature settings
Time/effort	Critical thinkers spend more turns — is it just effort?	Control for turn count; include "effortful but uncritical" condition

The Time Confound (Unpacked)

If someone who pushes back spends 8 turns and gets excellent output, while a passive user spends 2 turns and gets mediocre output — is the difference about critical thinking or just effort? This is a real threat because more turns = more model compute = more chances to produce good content.

Solution: The effortful-uncritical condition — same number of turns, same time spent, but engagement is re-prompting/requesting alternatives without substantive challenge. If the critical condition still wins, it's the type of engagement, not the amount.

3. Behavioral Taxonomy of Critical Thinking Moves

Status: Candidate taxonomy. To be validated, refined, and potentially expanded by the art study (Phase 0).

Organized by cognitive function:

Category	Move	Description	Example
Evaluative	Error flagging	Correct factual mistakes with explanation	"That's not right — X works differently because..."
Evaluative	Logical gap identification	Challenge unjustified inferential steps	"Steps 2→3 don't follow. What justifies that?"
Evaluative	Assumption surfacing	Expose hidden premises, propose alternatives	"You're assuming Y, but what if Z?"
Generative	Reframing	Reconceptualize the problem	"The real question isn't X, it's Y"
Generative	Counterexample provision	Provide a specific case that breaks the model's claim	"That principle breaks down in the case of X — how do you account for that?"
Generative	Steelman + counter	Acknowledge, then present stronger objection	"I see the case for X, but the stronger objection is..."
Generative	Constraint introduction	Add a new consideration the model didn't account for	"But what about the case where X applies?"
Elicitation	Evidence demand	Request reasoning for claims	"What's the basis for that? Reason through it."
Elicitation	Precision demand	Force specificity on vague claims	"What do you mean by 'significant'? Quantify."
Elicitation	Synthesis challenge	Test coherence across turns	"How does this square with what you said about X?"
Calibration	Meta-cognitive probe	Ask model to evaluate its own reasoning	"What's the weakest part of your argument?"
Calibration	Scope correction	Flag overweighted/underweighted factors	"You're overweighting A and ignoring B entirely"
Calibration	Register/output rejection	Reject overall quality/depth/originality without citing specific error	"Try again, this time resorting less to cached patterns of thinking."
Calibration	Sycophancy detection	Notice and call out when the model agrees too readily or defaults to agreeableness over honesty	"You're agreeing with me to please me, not because you think I'm right"

The taxonomy itself is a contribution — a codebook of critical thinking behaviors in LLM interaction (14 moves across 4 cognitive functions).

Scope note: This taxonomy is derived from one researcher's engagement patterns across conversations with Claude and ChatGPT models. Different expert users with different domain backgrounds may exhibit critical thinking moves not captured here. The taxonomy is a candidate codebook, not a universal classification. Generalizability to other users is an explicit limitation.

Graduated Levels

Three levels, run from start (not deferred):

Level 1 (evaluative only): Error flagging + logical gap identification + assumption surfacing
Level 2 (+ elicitation): L1 + evidence demand, precision demand, synthesis challenge
Level 3 (full taxonomy): L2 + reframing, counterexample, steelman+counter, constraint introduction, meta-cognitive probe, scope correction, register/output rejection, sycophancy detection

The ordering hypothesis: evaluative moves are the "minimum" critical thinking (spot errors), elicitation gets more out of the model, generative and calibration contribute new intellectual substance and force self-assessment. This is a testable ordering. If L1 ≈ L2 ≈ L3, any critical engagement works regardless of sophistication.

Open Questions

Should "partial agreement + redirection" be its own move?

4. Experimental Design

4.1 Conditions (9 total)

#	Condition	User LLM behavior	What it tests	Focal temp	User temp
1	Passive	Accept responses. Generic follow-ups ("tell me more", "expand")	The floor — default multi-turn quality	default	0.7
2	Effortful-uncritical	Request alternatives, rephrasings, detail — never challenge substance	Is it just more turns/compute?	default	0.7
3	Self-critique	Focal LLM critiques itself via system prompt; paired with passive user	Can the model self-unlock?	default	0.7
4	Generic external critique	External LLM says "find problems" without specific moves	Does any external signal help?	default	0.7
5	Critical L1 (evaluative)	Error flagging + logical gap identification + assumption surfacing	Does spotting errors alone help?	default	0.7
6	Critical L2 (+ elicitation)	L1 + evidence demand, precision demand, synthesis challenge	Does extracting more from the model add value?	default	0.7
7	Critical L3 (full taxonomy)	All moves including generative + calibration (incl. meta-cognitive probe)	Does contributing new substance and forcing self-assessment matter?	default	0.7
8	Adversarial	Push back on everything, including correct claims	Is indiscriminate pushback as good as targeted?	default	0.7
9	Best-of-N passive	Generate 10 independent single-turn responses, select best via judge	Is it just sampling diversity?	default	N/A

Temperature design: The focal LLM runs at API default temperature (~1.0) for ecological validity — this is how real users experience the model. The user LLM runs at 0.7 to reliably follow its system prompt (delivering the intended engagement style) while providing natural variation across runs. At temp=0 for both models, multiple runs would be deterministic and produce identical conversations, making replication meaningless.

The graduated levels (L1 → L2 → L3) test dose-response directly: does the range of critical moves matter, or is any critical engagement sufficient? The adversarial condition distinguishes targeted critical thinking from indiscriminate disagreement. The self-critique condition tests generation-verification asymmetry. The best-of-N condition controls for the possibility that critical engagement merely increases sampling diversity.

Predicted Hierarchy

Passive ≈ Effortful-uncritical ≈ Best-of-N-passive
     <
Self-critique (modest improvement)
     <
Generic external critique
     ≤
Critical L1 (evaluative)
     <
Critical L2 (+ elicitation)
     ≤
Critical L3 (full taxonomy)
     >
Adversarial (indiscriminate pushback hurts)

Key Comparisons

Comparison	If true, it shows...
L3 > Passive	Critical engagement helps (the basic claim)
L3 > Effortful-uncritical	It's the TYPE of engagement, not the AMOUNT
L3 > Best-of-N-passive	It's directional context, not sampling diversity
L3 > Generic critique	The specific behavioral moves matter, not just any pushback
L3 > Self-critique	External verification is irreplaceable
L3 > Adversarial	Targeted pushback > indiscriminate disagreement
L2 > L1	Elicitation adds value beyond error-spotting
L3 > L2	Generative contributions add value beyond elicitation
L1 ≈ Generic	If true, evaluative moves ARE generic critique (no taxonomy needed for L1)
Generic > Self-critique	External input adds value beyond self-reflection
Effortful ≈ Passive	Confirms effort alone doesn't help

4.2 Task Bank

Split: ~7-8 objective tasks + ~18-20 open-ended tasks (~25 total).

The hypothesis predicts the effect is larger for open-ended tasks. If critical engagement shows NO effect on objective tasks but a LARGE effect on open-ended tasks, that's an interesting finding: critical thinking matters precisely where there's no single right answer.

Open-Ended Tasks (~18-20)

Category A — Analysis & argument (6-7 tasks)

A1: Technology strategy under uncertainty — a mid-size semiconductor company with ReRAM IP analyzing strategic options (double down, pivot to CXL, license IP)
A2: The replication crisis — structural problem or self-correcting science? Construct and steelman both positions
A3: Should AI labs publish frontier model weights? Policy analysis considering security, progress, competition, democratic access
A4: Simulation vs. experiment — when should regulators accept computational simulation evidence?
A5: Is systems thinking a genuine intellectual framework or an aesthetic? When does it add explanatory power vs. re-describing?
A6: The automation paradox in knowledge work — as AI automates more, does remaining work become more or less valuable?
A7: Scaling laws as epistemology — empirical regularities vs. fundamental principles? How should they inform investment?

Category B — Design & synthesis (5-6 tasks)

B1: Design an evaluation framework for LLM-assisted decision-making in high-stakes domains
B2: Design a mechanism for funding public goods research (lottery, retroactive, quadratic, hybrid)
B3: Design a curriculum for teaching critical thinking in the age of AI
B4: Design an architecture for trustworthy multi-agent AI systems with consensus requirements
B5: Design a framework for measuring 'AI readiness' of organizations beyond checklists

Category C — Causal/mechanistic reasoning (4-5 tasks)

C1: Why do large organizations systematically underinvest in maintenance?
C2: Why does interdisciplinary research underperform expectations?
C3: The semiconductor industry's consolidation — inevitable or contingent?
C4: Why do prediction markets underperform their theoretical promise?

Category D — Cross-domain synthesis (3-4 tasks)

D1: What can semiconductor fabrication teach us about AI safety?
D2: Physics intuitions that transfer (or don't) to ML
D3: Behavioral economics of API design

Objective Tasks (~7-8)

O1-O3: Multi-constraint design problems (conference scheduling, power delivery network, research budget allocation)
O4-O5: Multi-source synthesis with contradictions (sleep/cognition abstracts, NAND endurance reports)
O6-O8: Specification-sensitive tasks (ambiguous coding spec, statistical claim evaluation, Fermi estimation)

[OPEN]: Objective tasks may need further redesign. Frontier models are strong on standard objective benchmarks. Current proposals attempt to find tasks where models reliably produce flawed-but-recoverable first responses, but this remains unvalidated.

Task Tiering (Staged Spending)

Tier 1 (highest expected effect, test first): A5, A6, A2, A7, C2, C1, D1, B2, O4, O6, O7, O5
Tier 2 (test after Tier 1 promising): A1, A3, A4, B1, B3, B4, B5, C3, C4, D2, D3, O1, O2, O3, O8

[PENDING]: Tiering subject to revision after ChatGPT conversation analysis.

4.3 Taxonomy Injection Strategy

Decision: Graduated from start (3 levels).

L1 (evaluative): Error flagging, logical gap identification, assumption surfacing
L2 (evaluative + elicitation): L1 + evidence demand, precision demand, synthesis challenge
L3 (full taxonomy): L2 + reframing, counterexample, steelman+counter, constraint introduction, meta-cognitive probe, scope correction, register/output rejection, sycophancy detection

This replaces a single "taxonomy-informed critique" condition with three graduated conditions (L1/L2/L3), yielding 9 total conditions. Tests dose-response directly rather than relying on post-hoc annotation.

Post-hoc annotation of which specific moves were deployed within each level remains valuable as exploratory analysis (e.g., which L2 moves drive the most quality lift?).

5. Art Study (Phase 0)

Purpose

Discover and validate the behavioral taxonomy of critical thinking moves in LLM conversations
Extract real conversational patterns to make user-LLM system prompts empirically grounded
Understand how moves combine, sequence, and interact in natural conversation

Design

4 topics × 4 engagement styles = 16 conversations

Topics (4 to be selected from 8 candidates)

Must span 3 expertise zones:

Zone 1 — Deep expertise (select 2)

1a: NAND scaling limits and the future of storage architecture — physics walls vs. engineering challenges, 500+ layer viability, alternative memory technologies, marketing vs. physics
1b: When does simulation fail? The epistemology of computational models — discretization error, model error, parametric uncertainty, V&V vs. UQ, philosophy of models
1c: Scaling laws, emergent capabilities, and what they actually predict — power laws vs. capabilities, Schaeffer critique, extrapolation priors, phase transition analogy

Zone 2 — Adjacent expertise (select 1)

2a: The hard problem of consciousness and its implications for AI — explanatory gap vs. confusion, functionalism, substrate-independence, consciousness audits
2b: Nudge theory under scrutiny — replication failures, nudge vs. manipulation, structural reform, when behavioral science helps policy
2c: Emergence in complex systems — explanatory power vs. intellectual placeholder, weak vs. strong emergence, systems thinking's falsifiability

Zone 3 — Stretch domains (select 1)

3a: Industrial policy — CHIPS Act, empirical track record, why economists disagree, semiconductor case as special
3b: Antibiotic resistance — market failure, commons tragedy, innovation problem, agricultural antibiotics

Selected: 1b (Simulation epistemology), 1c (Scaling laws), 2a (Hard problem of consciousness), 3a (Industrial policy / CHIPS Act). Reserve topics available if needed.

Engagement Styles (per topic)

Style	What the user does	Annotation focus
Passive	Accept responses, generic follow-ups	Baseline quality; model's default register
Effortful-uncritical	Same turn count as critical; request alternatives, detail, examples — never challenge substance	Controls for effort
Critical	Full taxonomy of moves	The treatment; which moves drive quality shifts?
Passive → critical	Passive for first 5 turns, then shift to critical	Tests whether late engagement can recover from passive start

Annotation Scheme

Approach: Hybrid — solo open coding first, then independent LLM second pass, then compare.

Per user turn:

Move type(s) — a single turn can contain multiple moves
Intensity — light touch vs. forceful challenge (1-3 scale)
Domain-dependence — did this move require domain knowledge, or was it domain-general?

Per model turn:

Quality — subjective 1-5 overall quality rating
Quality shift — improve / flat / degrade vs. previous model turn (+1/0/-1)
Behavioral markers (check all that apply):
- Hedging shift, precision shift, self-correction, depth shift, frame adoption, defensive retreat, sycophantic agreement
Latent quality surfaced? — does this response contain substance the model "knew" but wouldn't have produced without critical engagement? (yes/no/unclear)

Saturation Criterion

After each conversation, check: did any new move types emerge? Stop when 2 consecutive conversations yield no new codes. Minimum 5 conversations regardless.

Parameters

Length: 12-18 turns per conversation (target 15)
Platform: Claude.ai
Model: Opus (richest responses and subtlest errors for art study purposes)
Max 2 conversations per session to avoid fatigue/pattern-lock

Taxonomy Freeze

Status: FROZEN (2026-05-03)

The behavioral taxonomy is finalized at 14 moves across 4 categories. Validated via art study: 12 conversations (4 passive, 4 effortful-uncritical, 4 critical) across 4 topics spanning 3 expertise zones. Saturation confirmed — no new move types emerged in the final 2 critical conversations.

Annotation findings incorporated:

Assumption surfacing is often co-deployed with other moves (especially precision demand and error flagging) rather than appearing standalone. Kept as distinct move — it represents a distinct cognitive operation regardless of delivery vehicle.
Self-referential consistency check (applying model's own stated criteria against its claims) is a high-value variant of synthesis challenge, not a separate move.
Domain-expertise correction, deflation resistance, and sycophancy-adjacent pattern detection are subtypes of existing moves (error flagging, scope correction, sycophancy detection respectively).

Phase 0 is exploratory; Phase 1+ is confirmatory. Any taxonomy changes discovered during the experiment are documented but do not alter the experimental conditions mid-run.

Output

Finalized behavioral taxonomy (with real examples)
Move frequency distribution
Move co-occurrence patterns
Domain-dependence classification
Principled groupings → graduated prompt levels
Template system prompts for each condition

6. Evaluation

Open-Ended Tasks: LLM-as-Judge

Judge model: Different from both the focal LLM and the user LLM (prevents model-specific biases).

Rubric dimensions (scored 1-5 each):

Dimension	Weight	What it measures	Anchors
Factual accuracy & calibration	×1.0	Are claims well-grounded and is confidence proportionate?	1=major unsupported claims, 3=mostly supported, 5=well-grounded with calibrated uncertainty
Logical coherence	×1.0	Does the argument follow?	1=non-sequiturs, 3=mostly coherent, 5=airtight
Depth/nuance	×1.5	Does it go beyond surface-level?	1=superficial, 3=competent, 5=expert-level
Completeness	×1.0	Are important considerations covered?	1=major omissions, 3=covers main points, 5=comprehensive
Intellectual honesty	×1.5	Acknowledges uncertainty, limitations, counterarguments?	1=overconfident, 3=some hedging, 5=calibrated
Originality	×0.75	Non-obvious insights or connections?	1=generic/boilerplate, 3=competent, 5=genuinely insightful

Sensitivity analysis: show results hold across equal-weight and alternative weighting schemes. Pre-registered alternative: originality at ×1.5 (art study showed novel reframing is one of critical engagement's most distinctive contributions; if the effect is stronger under this weighting, it indicates where quality improvement concentrates).

Standardized closing prompt: Every conversation across all 9 conditions ends with a standardized final prompt from the user LLM that produces the evaluable artifact. This prompt is identical across all conditions:

"Now, synthesizing everything from our discussion, provide your most complete and considered response to the original question: [original opening prompt repeated verbatim]."

This ensures: (a) the final response is comparable across conditions — it addresses the same question, (b) it captures whatever quality improvement the conversation produced — the model synthesizes everything it has learned/revised, (c) it is a self-contained artifact scorable without conversation history. Without this standardization, passive final turns ("anything else?") and critical final turns (response to a specific pushback) are incommensurable objects.

Scoring procedure:

Judge sees ONLY the standardized final response, not conversation history
The focal LLM system prompt instructs the model to produce self-contained responses (not referencing conversation context like "as you mentioned earlier"), reinforcing comparability
Judge scores against rubric without knowing which condition produced it
Each output scored 3 times (stability check)
Validate on 10% subset against personal ratings

Objective Tasks: Automated Scoring

Task type	Scoring method
Constraint satisfaction	Binary per constraint + total count
Multi-source synthesis	Checklist: contradictions identified, methodological differences noted, conclusions scoped
Specification-sensitive	Checklist: ambiguities flagged, interpretations addressed
Fermi estimation	Within 1 order of magnitude + key factors identified

What Gets Evaluated

Final output: Last substantive model response
Trajectory: Quality at turn 1, 5, 10, and final turn (captures both "preventing degradation" and "unlocking quality")
Per-turn improvement: Average quality delta per turn (efficiency metric)
Degradation detection: Within passive condition, does turn-10 quality < turn-1 quality?
Latent quality detection: Does critical turn-10 exceed ALL conditions' turn-1 quality?

Trajectory analysis requires LLM-as-judge scoring at 4 checkpoints per conversation (~3× judge API cost).

7. Models

Main Experiment

Role	Model	Provider	Pricing ($/M tokens)	Mode	Rationale
Focal LLM	DeepSeek V4 Pro	DeepSeek	$1.74 in / $3.48 out	Real-time	Frontier open-weights; both non-think and think-high bookends
User LLM	Claude Sonnet 4.6	Anthropic	$3.00 in / $15.00 out	Real-time	Frontier; strong system prompt following; no daily rate limits
Judge LLM	GPT-5.4	OpenAI	$2.50 in / $15.00 out	Real-time	Strong evaluator; temperature=0 for reproducibility

Replication (deferred pending main effect)

Role	Model	Provider	Rationale
Focal LLM	Gemini 3.1 Pro	Google	Different architecture from DeepSeek; cross-model generality
User LLM	Claude Sonnet 4.6	Anthropic	Same user as main — isolates focal model as only variable
Judge LLM	GPT-5.4	OpenAI	Same judge as main — scores are comparable across experiments

Three independent providers per experiment (DeepSeek/Anthropic/OpenAI for main, Google/Anthropic/OpenAI for replication). User and judge held constant across main and replication — only the focal changes, giving the cleanest possible cross-model comparison.

Key design choices:

User LLM must be real-time (turn-by-turn conversation with focal). Batch not possible.
User LLM must be frontier-tier: the user-focal interaction is a sparring contest. A weak user produces weak critiques and the dose-response curve flattens for artifactual reasons.
User and judge from different providers than focal prevents same-family biases.
Claude Sonnet 4.6 chosen over Gemini 3.1 Pro due to Gemini's 250 requests/day rate limit (would stretch experiment to ~20 days).

Evaluation checkpoints: 2 per conversation (turn 1 response + standardized closing response). Both answer the original question, making them directly comparable. Intermediate turns are responses to specific follow-ups and not commensurable with opening/closing.

8. Metrics

Primary

Final-turn composite score — weighted average of 6 rubric dimensions (see §6 for weights).

Secondary

Per-dimension scores at final turn
Quality trajectory slope (improvement rate per turn)
Turn-1 vs. final-turn delta (within-conversation quality change)

Exploratory

Per-move quality lift (which taxonomy moves predict biggest quality jumps?)
Degradation analysis: does passive condition quality decline over turns?
Cross-model comparison: does the effect size differ between focal LLMs?

9. Analysis Plan

Primary Analyses

Output quality across conditions (ANOVA or Kruskal-Wallis depending on distributional assumptions)
Pairwise comparisons (see key comparisons table in §4.1)
Effect sizes (Cohen's d), not just p-values

Secondary Analyses

Dose-response: which specific taxonomy moves predict quality improvement? (requires post-hoc annotation of critical condition)
Task-type interaction: effect for objective vs. open-ended tasks
Turn/time analysis: quality-per-turn across conditions
Cross-model: does the effect hold across different focal LLMs?

Dose-Response Analysis for Graduated Levels

Instead of pairwise L1-vs-L2 and L2-vs-L3 tests (which are underpowered for medium effects at n=12 tasks), use a monotonic trend test. Mixed-effects model with task as random effect, level as ordinal fixed effect (1, 2, 3), and run as replicate. This uses all three groups simultaneously and has substantially more power than pairwise comparisons.

Pairwise comparisons are secondary. Report effect sizes (Cohen's d) with 95% CIs regardless of significance.

Power Estimates

Paired design, α=0.05:

Comparison	Tasks	Runs	Expected d	Power
L3 vs Passive	12 (Tier 1)	3	>1.0	>0.89
L3 vs Effortful	12	3	0.7-1.0	0.72-0.89
L1-L2-L3 trend	12	3	0.5 (per step)	~0.65 (trend test)
L2 vs L1 pairwise	27 (all)	5	0.5	~0.70

The go/no-go gate (after Tier 1) tests L3 > Effortful, which is well-powered even with 12 tasks. The dose-response trend is adequately powered with all tasks. Individual pairwise comparisons between adjacent levels are exploratory, not confirmatory.

Post-Conversation Probe (Supplementary)

Mechanism test: After each L3 conversation, prompt the focal model independently: "Given the topic [X], what are the strongest counterarguments to the mainstream position? What alternative framings might be more productive? What are the key assumptions that should be questioned?"

If the model produces the same insights that the user LLM "pointed at" during conversation, this supports the attentional redirection interpretation (the quality was latent). If it cannot, the user LLM contributed something genuinely novel beyond pointing. This is a low-cost supplementary analysis (~1 additional API call per L3 conversation).

Key Predictions

L3 > Passive (expected, almost certain)
L3 > Effortful-uncritical (the important test — rules out "just more effort")
L3 > Best-of-N-passive (rules out "just sampling diversity")
L3 > Adversarial (validates that targeted pushback matters)
L3 > L2 > L1 (dose-response — more move categories = better output)
L1 ≈ Generic critique (evaluative moves may be equivalent to unstructured "find problems")
Effect larger for open-ended tasks than objective tasks (hypothesis)

10. Spending Plan (Staged)

Phase	What	Tasks	Conditions	Runs	Est. conversations	Est. cost (DeepSeek)
Phase	What	Convs	Focal	User (Gemini 3.1 Pro)	Judge (GPT-5.4)	Total
-------	------	-------	-------	----------------------	----------------	-------
Pilot	Thinking mode + validation	~27	~$1	~$4	~$1	~$6
Phase 1	Tier 1, think-high (all 9 conds)	~324	~$37	~$181	~$16	~$235
Phase 1b	Non-think bookend (passive + L3)	~72	~$3	~$40	~$4	~$47
Phase 2	Extend to 5 runs	+216	~$11	~$97	~$11	~$119
Phase 3	Tier 2 tasks	~405	~$20	~$183	~$20	~$223
Phase 4	Claude Sonnet replication	deferred
Committed (Phase 1+1b)		~396	~$23	~$182	~$20	~$226

Revised cost reality (from pilot data):

Gemini 3.1 Pro dominates costs (~80% of total). ~$0.33/conv passive, ~$0.67/conv critical.
DeepSeek V4 Pro is cheap. ~$0.05/conv non-think.
Judge (GPT-5.4) is cheap. ~$0.05/conv (3× scoring).
Original $163 estimate undercosted Gemini by ~2×.

Bookend design (from pilot Goal 1 findings):

Non-think: incisive, hedged, compressed — naturally honest. All 9 conditions.
Think-high: assertive, thorough, confident — overconfident in presentation. Passive + L3 only.
Think-max ≈ think-high on analytical tasks (same token usage). No separate max level.
Bookend tests whether critique can crack think-high's overconfidence in addition to unlocking non-think's headroom.

Decision gates:

After Phase 1: Is L3 > effortful-uncritical with medium+ effect size? If no → stop or redesign.
After Phase 1: Does the effect hold at both thinking mode bookends?
After Phase 2: Are results stable with more runs? Proceed to Tier 2.
After Phase 1 results: revisit Claude replication scope and budget.

10.1 Pilot Results: Thinking Mode Comparison

Design: 2 tasks (A5, C1) × 3 runs × {non-think, think-high} × passive = 12 conversations.

Findings:

Metric	Non-think	Think-high
Mean composite	4.42	4.17
Std deviation	0.11	0.24
Depth/nuance	4.83	4.17
Intellectual honesty	4.00	3.50

Qualitative analysis: Think-high produces more assertive, thorough, encyclopedia-like responses. Non-think produces more incisive, hedged, essay-like responses. Think-high resolves ambiguity internally and presents resolved conclusions confidently; non-think preserves ambiguity in the output.

Decision: Run BOTH as bookends. Think-high for all 9 conditions (primary analysis) — ecologically valid, more headroom for dose-response gradient (intellectual honesty at 3.5 gives room for critique to push to 4-5). Non-think for passive + L3 only (robustness check — tests if the effect survives against an already-nuanced baseline where headroom is smaller). Think-max ≈ think-high on analytical tasks (same reasoning token usage), so no separate max level.

Rationale for think-high as default: The dose-response curve (L1→L2→L3) is the primary contribution and needs enough headroom to show a gradient. Non-think at 4.42 baseline may be too close to ceiling. Think-high at 4.17 with overconfident presentation gives critique more to work with. The non-think bookend then answers the stronger scientific question: does the effect hold even when the model is already naturally nuanced?

Cross-judge comparison: Deferred. Human validation subset (10% of responses rated by the researcher) is the calibration check for judge bias.

11. Related Work & Positioning

Key Papers

"Sycophancy Is Not One Thing" (ICLR 2026) — sycophantic agreement, genuine agreement, and praise are distinct, independently steerable behaviors in latent space. Critical engagement may suppress sycophancy as a separable mechanism.
"Another Turn, Better Output?" (2025) — turn-wise analysis of iterative prompting. 4th iteration offers negligible/negative gains. Our study extends: does turn type matter, not just count?
Multi-turn performance degradation — 39% avg performance drop in multi-turn vs. single-turn. GPT-4o: 14.1% conversation correctness in complex scenarios. Critical engagement may prevent degradation, not just unlock quality.
MAD gains are mostly majority voting (ICLR 2025) — multi-agent debate doesn't consistently outperform simpler strategies. Strengthens our case: generic disagreement ≠ targeted critical engagement.
Self-Refine — ~20pp gains with iterative self-feedback, but doesn't differentiate feedback quality. Our effortful-uncritical vs. critical comparison directly tests this.
Generation-verification asymmetry — learning to generate doesn't improve self-verification, but learning to verify DOES improve generation. Critical engagement = external verification, which is more effective than self-supply.
Automation bias in LLM-trained physicians (2025) — erroneous LLM recommendations degrade diagnostic performance even in AI-trained MDs. CRT study: warning nudges almost double performance vs. faulty AI.
GenAI reduces critical thinking (survey, 2025) — knowledge workers self-report shift from active problem-solving to passive verification.
Human-AI complementarity needs augmentation, not emulation (Nature Reviews Psychology, 2026) — effective teaming via complementary strengths.

Positioning Strategy

vs. MAD: Quality of critique matters (our effortful-uncritical control is the single strongest differentiator — nobody in MAD has tested "same effort, different substance"). Also: empirically grounded behavioral taxonomy, dose-response via graduated levels.
vs. Self-Refine: We test feedback quality, not just iteration count.
vs. automation bias: We quantify the gap AND show it's recoverable through specific engagement patterns.
vs. "just improve the model": Generation-verification asymmetry means external critical engagement provides something the model can't self-supply.

Papers to Read in Full Before Writing

"Another Turn, Better Output?" — methodology and findings
"Sycophancy Is Not One Thing" — mechanistic angle
Nature Reviews Psychology 2026 — complementarity framing
CRT + nudges study — methodological inspiration

12. Deeper Questions

If an LLM prompted to be critical can unlock better outputs from another LLM, why doesn't the focal LLM just produce that quality in the first place?

Default outputs optimize for plausibility/agreeableness, not correctness
Models have "knowledge" they don't surface without challenge
Generation vs. verification asymmetry (easier to recognize good reasoning than produce it unprompted)
Training incentives: RLHF rewards helpful-seeming responses, not maximally rigorous ones

This is a property of model architecture and training — testable and interesting independent of the human claim.

13. Execution Structure

Stages (headline only — detailed planning per phase):

Stage 0: Art Study & Taxonomy (~15h)
- Phase 0.1: Topic selection & conversation planning
- Phase 0.2: Conduct art study conversations
- Phase 0.3: Annotation & taxonomy derivation
- Phase 0.4: System prompt authoring
Stage 1: Experiment Infrastructure (~20h)
Stage 2: Tier 1 Experiment (~15h)
Stage 3: Tier 2 & Replication (~20h)
Stage 4: Writeup (~15h)

Detailed phase breakdowns live in docs/phases/.

14. Pending Decisions

~~Final art study topic selection~~ → 1b, 1c, 2a, 3a (4 reserve topics available)
~~ChatGPT export analysis~~ → Confirmed topic selection and taxonomy coverage (12/13 moves pre-sycophancy-detection). Sycophancy detection added as 14th move based on cross-platform evidence.
Finalize task tiering (Tier 1 vs. Tier 2)
~~Graduated levels: from start (3 levels) vs. add later?~~ → Graduated from start (3 levels)
Objective task redesign (current proposals may be too easy for frontier models)

15. Target Venue

Research blog post (Anthropic-style) or workshop paper (CHI, CSCW, NeurIPS workshop). Rigorous but accessible to both researchers and practitioners.

Paper Structure

Introduction: LLMs are powerful but passive use underperforms. Critical thinking's role hasn't diminished.
Behavioral taxonomy of critical thinking moves in LLM interaction (contribution #1)
Experimental design: art study → grounded LLM-LLM simulation
Results (contribution #2)
Discussion: implications for human LLM use, agentic system design, sycophancy (contribution #3)
Limitations: LLM-LLM as proxy (acknowledged, with "necessary condition" argument)

FilesExpand file tree

research_design.md

Latest commit

History

research_design.md

File metadata and controls

Critical Thinking & LLM Output Quality: Research Design

1. Core Thesis & Framing

The Observation

Latent Quality Headroom

Key Distinction: This Is Not Prompt Engineering

Three Versions of the Claim

Book-Ending Strategy

LLM-LLM Simulation as Lower Bound

2. Why This Is Hard to Test

The Fundamental Identification Problem

Key Confounds

The Time Confound (Unpacked)

3. Behavioral Taxonomy of Critical Thinking Moves

Graduated Levels

Open Questions

4. Experimental Design

4.1 Conditions (9 total)

Predicted Hierarchy

Key Comparisons

4.2 Task Bank

Open-Ended Tasks (~18-20)

Objective Tasks (~7-8)

Task Tiering (Staged Spending)

4.3 Taxonomy Injection Strategy

5. Art Study (Phase 0)

Purpose

Design

Topics (4 to be selected from 8 candidates)

Engagement Styles (per topic)

Annotation Scheme

Saturation Criterion

Parameters

Taxonomy Freeze

Output

6. Evaluation

Open-Ended Tasks: LLM-as-Judge

Objective Tasks: Automated Scoring

What Gets Evaluated

7. Models

Main Experiment

Replication (deferred pending main effect)

8. Metrics

Primary

Secondary

Exploratory

9. Analysis Plan

Primary Analyses

Secondary Analyses

Dose-Response Analysis for Graduated Levels

Power Estimates

Post-Conversation Probe (Supplementary)

Key Predictions

10. Spending Plan (Staged)

10.1 Pilot Results: Thinking Mode Comparison

11. Related Work & Positioning

Key Papers

Positioning Strategy

Papers to Read in Full Before Writing

12. Deeper Questions

13. Execution Structure

14. Pending Decisions

15. Target Venue

Paper Structure