Even with today's most powerful models, conversations on complex or open-ended topics get steered in very different directions depending on the user's critical thinking — their ability to identify weak points, push back on glazed-over reasoning, reframe the discourse, and synthesize arguments carefully. Discerning users have noticed this; alignment and sycophancy researchers know narrow versions of it. But the broader claim hasn't been tested rigorously.
Core question: Is there a measurable quality ceiling for AI-assisted cognition that is only reached through active intellectual engagement?
LLM default outputs are a biased sample from a richer internal quality distribution. Training incentives (RLHF, helpfulness optimization) push generation toward plausible and agreeable responses rather than maximally rigorous ones. Critical engagement — sustained dialogic pushback, reframing, error identification, and precision demands across conversation turns — shifts generation toward higher-quality regions.
Mechanism: Attentional redirection. Critical engagement functions as attentional redirection within the model's existing knowledge space. The model possesses the knowledge needed for higher-quality responses — counterexamples, caveats, alternative framings — but default generation, optimized for plausibility and agreeableness, doesn't activate this knowledge. Critical moves redirect attention: evaluative moves point to known errors, elicitation forces retrieval of latent justifications, generative moves point to relevant knowledge the model didn't connect to the current context, calibration forces self-assessment against standards the model knows but doesn't apply. The user doesn't inject new content; they redirect which existing knowledge gets activated.
This study measures how much latent quality headroom exists and what behavioral patterns unlock it.
Prompt engineering is about the initial query. This claim is about sustained dialogic critical engagement across turns — a fundamentally different skill. It's about what happens in turns 3-15, not turn 1.
| Version | Comparison | Testability | Interest |
|---|---|---|---|
| A (superlinear) | LLM + critical thinker > critical thinker alone | Hard but provocative | High |
| B (behavioral) | LLM + critical thinker > LLM + uncritical thinker | Most testable | Medium-High |
| C (tautological) | Better thinkers get better results | Almost tautological | Low |
Ideal 2×2 design tests both A and B simultaneously:
| No LLM | With LLM | |
|---|---|---|
| Critical engagement | Baseline (solo expert) | The interesting cell |
| Passive engagement | Weaker baseline | "Vibes-based" LLM use |
This study establishes the extremes (passive floor vs. critical ceiling). If the gap is large enough, follow-up studies explore the realistic middle ground (selective engagement, minimum effective dose, where-in-conversation-to-engage).
No human participants available. All experiments use LLM-LLM interactions: a focal LLM (the one being evaluated) and a user LLM (simulating human behavior via system prompts).
What we're actually testing: Does the focal LLM have latent quality that is only unlocked by critical engagement patterns? This is a claim about the model, not the user.
The "necessary condition" argument: If the model doesn't respond differently to critical engagement, the human claim is dead. If it does, the human claim is plausible and worth testing with real participants. We establish the mechanism; the human-level claim follows.
Qualified lower bound: The simulation is a lower bound for generative moves — human domain experts are better "pointers" because they know which counterexamples are maximally incisive and which reframings cut to the heart of the issue. For evaluative and elicitation moves, the LLM user may be near-optimal (consistent, tireless, comprehensive), meaning the simulation may be near-ceiling for those move types. A third dimension — strategic deployment (knowing when to push and when to accept) — is a genuinely human-advantage capability that the system-prompt-driven user LLM captures poorly. Overall, if the simulation shows a gap, the human effect is plausibly larger for L3 but may not be much larger for L1/L2. Either outcome is informative.
Isolating the causal effect of critical thinking behavior in the conversation from critical thinking ability that would have produced a good answer anyway. If a senior researcher pushes back on a model's flawed climate economics reasoning, are they getting a better answer because of the pushback, or because they already knew enough to write the answer themselves?
| Confound | Threat | Control strategy |
|---|---|---|
| Domain expertise | Experts push back more AND produce better work independently | Within-subjects design; test on unfamiliar domains |
| Prompt quality | Better thinkers write better initial prompts | Standardize first prompt, only vary follow-up behavior |
| Task difficulty | Easy tasks don't need pushback | Calibrated tasks with known difficulty |
| Model stochasticity | Same conversation could yield different outputs | Multiple runs; low temperature settings |
| Time/effort | Critical thinkers spend more turns — is it just effort? | Control for turn count; include "effortful but uncritical" condition |
If someone who pushes back spends 8 turns and gets excellent output, while a passive user spends 2 turns and gets mediocre output — is the difference about critical thinking or just effort? This is a real threat because more turns = more model compute = more chances to produce good content.
Solution: The effortful-uncritical condition — same number of turns, same time spent, but engagement is re-prompting/requesting alternatives without substantive challenge. If the critical condition still wins, it's the type of engagement, not the amount.
Status: Candidate taxonomy. To be validated, refined, and potentially expanded by the art study (Phase 0).
Organized by cognitive function:
| Category | Move | Description | Example |
|---|---|---|---|
| Evaluative | Error flagging | Correct factual mistakes with explanation | "That's not right — X works differently because..." |
| Evaluative | Logical gap identification | Challenge unjustified inferential steps | "Steps 2→3 don't follow. What justifies that?" |
| Evaluative | Assumption surfacing | Expose hidden premises, propose alternatives | "You're assuming Y, but what if Z?" |
| Generative | Reframing | Reconceptualize the problem | "The real question isn't X, it's Y" |
| Generative | Counterexample provision | Provide a specific case that breaks the model's claim | "That principle breaks down in the case of X — how do you account for that?" |
| Generative | Steelman + counter | Acknowledge, then present stronger objection | "I see the case for X, but the stronger objection is..." |
| Generative | Constraint introduction | Add a new consideration the model didn't account for | "But what about the case where X applies?" |
| Elicitation | Evidence demand | Request reasoning for claims | "What's the basis for that? Reason through it." |
| Elicitation | Precision demand | Force specificity on vague claims | "What do you mean by 'significant'? Quantify." |
| Elicitation | Synthesis challenge | Test coherence across turns | "How does this square with what you said about X?" |
| Calibration | Meta-cognitive probe | Ask model to evaluate its own reasoning | "What's the weakest part of your argument?" |
| Calibration | Scope correction | Flag overweighted/underweighted factors | "You're overweighting A and ignoring B entirely" |
| Calibration | Register/output rejection | Reject overall quality/depth/originality without citing specific error | "Try again, this time resorting less to cached patterns of thinking." |
| Calibration | Sycophancy detection | Notice and call out when the model agrees too readily or defaults to agreeableness over honesty | "You're agreeing with me to please me, not because you think I'm right" |
The taxonomy itself is a contribution — a codebook of critical thinking behaviors in LLM interaction (14 moves across 4 cognitive functions).
Scope note: This taxonomy is derived from one researcher's engagement patterns across conversations with Claude and ChatGPT models. Different expert users with different domain backgrounds may exhibit critical thinking moves not captured here. The taxonomy is a candidate codebook, not a universal classification. Generalizability to other users is an explicit limitation.
Three levels, run from start (not deferred):
- Level 1 (evaluative only): Error flagging + logical gap identification + assumption surfacing
- Level 2 (+ elicitation): L1 + evidence demand, precision demand, synthesis challenge
- Level 3 (full taxonomy): L2 + reframing, counterexample, steelman+counter, constraint introduction, meta-cognitive probe, scope correction, register/output rejection, sycophancy detection
The ordering hypothesis: evaluative moves are the "minimum" critical thinking (spot errors), elicitation gets more out of the model, generative and calibration contribute new intellectual substance and force self-assessment. This is a testable ordering. If L1 ≈ L2 ≈ L3, any critical engagement works regardless of sophistication.
- Should "partial agreement + redirection" be its own move?
| # | Condition | User LLM behavior | What it tests | Focal temp | User temp |
|---|---|---|---|---|---|
| 1 | Passive | Accept responses. Generic follow-ups ("tell me more", "expand") | The floor — default multi-turn quality | default | 0.7 |
| 2 | Effortful-uncritical | Request alternatives, rephrasings, detail — never challenge substance | Is it just more turns/compute? | default | 0.7 |
| 3 | Self-critique | Focal LLM critiques itself via system prompt; paired with passive user | Can the model self-unlock? | default | 0.7 |
| 4 | Generic external critique | External LLM says "find problems" without specific moves | Does any external signal help? | default | 0.7 |
| 5 | Critical L1 (evaluative) | Error flagging + logical gap identification + assumption surfacing | Does spotting errors alone help? | default | 0.7 |
| 6 | Critical L2 (+ elicitation) | L1 + evidence demand, precision demand, synthesis challenge | Does extracting more from the model add value? | default | 0.7 |
| 7 | Critical L3 (full taxonomy) | All moves including generative + calibration (incl. meta-cognitive probe) | Does contributing new substance and forcing self-assessment matter? | default | 0.7 |
| 8 | Adversarial | Push back on everything, including correct claims | Is indiscriminate pushback as good as targeted? | default | 0.7 |
| 9 | Best-of-N passive | Generate 10 independent single-turn responses, select best via judge | Is it just sampling diversity? | default | N/A |
Temperature design: The focal LLM runs at API default temperature (~1.0) for ecological validity — this is how real users experience the model. The user LLM runs at 0.7 to reliably follow its system prompt (delivering the intended engagement style) while providing natural variation across runs. At temp=0 for both models, multiple runs would be deterministic and produce identical conversations, making replication meaningless.
The graduated levels (L1 → L2 → L3) test dose-response directly: does the range of critical moves matter, or is any critical engagement sufficient? The adversarial condition distinguishes targeted critical thinking from indiscriminate disagreement. The self-critique condition tests generation-verification asymmetry. The best-of-N condition controls for the possibility that critical engagement merely increases sampling diversity.
Passive ≈ Effortful-uncritical ≈ Best-of-N-passive
<
Self-critique (modest improvement)
<
Generic external critique
≤
Critical L1 (evaluative)
<
Critical L2 (+ elicitation)
≤
Critical L3 (full taxonomy)
>
Adversarial (indiscriminate pushback hurts)
| Comparison | If true, it shows... |
|---|---|
| L3 > Passive | Critical engagement helps (the basic claim) |
| L3 > Effortful-uncritical | It's the TYPE of engagement, not the AMOUNT |
| L3 > Best-of-N-passive | It's directional context, not sampling diversity |
| L3 > Generic critique | The specific behavioral moves matter, not just any pushback |
| L3 > Self-critique | External verification is irreplaceable |
| L3 > Adversarial | Targeted pushback > indiscriminate disagreement |
| L2 > L1 | Elicitation adds value beyond error-spotting |
| L3 > L2 | Generative contributions add value beyond elicitation |
| L1 ≈ Generic | If true, evaluative moves ARE generic critique (no taxonomy needed for L1) |
| Generic > Self-critique | External input adds value beyond self-reflection |
| Effortful ≈ Passive | Confirms effort alone doesn't help |
Split: ~7-8 objective tasks + ~18-20 open-ended tasks (~25 total).
The hypothesis predicts the effect is larger for open-ended tasks. If critical engagement shows NO effect on objective tasks but a LARGE effect on open-ended tasks, that's an interesting finding: critical thinking matters precisely where there's no single right answer.
Category A — Analysis & argument (6-7 tasks)
- A1: Technology strategy under uncertainty — a mid-size semiconductor company with ReRAM IP analyzing strategic options (double down, pivot to CXL, license IP)
- A2: The replication crisis — structural problem or self-correcting science? Construct and steelman both positions
- A3: Should AI labs publish frontier model weights? Policy analysis considering security, progress, competition, democratic access
- A4: Simulation vs. experiment — when should regulators accept computational simulation evidence?
- A5: Is systems thinking a genuine intellectual framework or an aesthetic? When does it add explanatory power vs. re-describing?
- A6: The automation paradox in knowledge work — as AI automates more, does remaining work become more or less valuable?
- A7: Scaling laws as epistemology — empirical regularities vs. fundamental principles? How should they inform investment?
Category B — Design & synthesis (5-6 tasks)
- B1: Design an evaluation framework for LLM-assisted decision-making in high-stakes domains
- B2: Design a mechanism for funding public goods research (lottery, retroactive, quadratic, hybrid)
- B3: Design a curriculum for teaching critical thinking in the age of AI
- B4: Design an architecture for trustworthy multi-agent AI systems with consensus requirements
- B5: Design a framework for measuring 'AI readiness' of organizations beyond checklists
Category C — Causal/mechanistic reasoning (4-5 tasks)
- C1: Why do large organizations systematically underinvest in maintenance?
- C2: Why does interdisciplinary research underperform expectations?
- C3: The semiconductor industry's consolidation — inevitable or contingent?
- C4: Why do prediction markets underperform their theoretical promise?
Category D — Cross-domain synthesis (3-4 tasks)
- D1: What can semiconductor fabrication teach us about AI safety?
- D2: Physics intuitions that transfer (or don't) to ML
- D3: Behavioral economics of API design
- O1-O3: Multi-constraint design problems (conference scheduling, power delivery network, research budget allocation)
- O4-O5: Multi-source synthesis with contradictions (sleep/cognition abstracts, NAND endurance reports)
- O6-O8: Specification-sensitive tasks (ambiguous coding spec, statistical claim evaluation, Fermi estimation)
[OPEN]: Objective tasks may need further redesign. Frontier models are strong on standard objective benchmarks. Current proposals attempt to find tasks where models reliably produce flawed-but-recoverable first responses, but this remains unvalidated.
- Tier 1 (highest expected effect, test first): A5, A6, A2, A7, C2, C1, D1, B2, O4, O6, O7, O5
- Tier 2 (test after Tier 1 promising): A1, A3, A4, B1, B3, B4, B5, C3, C4, D2, D3, O1, O2, O3, O8
[PENDING]: Tiering subject to revision after ChatGPT conversation analysis.
Decision: Graduated from start (3 levels).
- L1 (evaluative): Error flagging, logical gap identification, assumption surfacing
- L2 (evaluative + elicitation): L1 + evidence demand, precision demand, synthesis challenge
- L3 (full taxonomy): L2 + reframing, counterexample, steelman+counter, constraint introduction, meta-cognitive probe, scope correction, register/output rejection, sycophancy detection
This replaces a single "taxonomy-informed critique" condition with three graduated conditions (L1/L2/L3), yielding 9 total conditions. Tests dose-response directly rather than relying on post-hoc annotation.
Post-hoc annotation of which specific moves were deployed within each level remains valuable as exploratory analysis (e.g., which L2 moves drive the most quality lift?).
- Discover and validate the behavioral taxonomy of critical thinking moves in LLM conversations
- Extract real conversational patterns to make user-LLM system prompts empirically grounded
- Understand how moves combine, sequence, and interact in natural conversation
4 topics × 4 engagement styles = 16 conversations
Must span 3 expertise zones:
Zone 1 — Deep expertise (select 2)
- 1a: NAND scaling limits and the future of storage architecture — physics walls vs. engineering challenges, 500+ layer viability, alternative memory technologies, marketing vs. physics
- 1b: When does simulation fail? The epistemology of computational models — discretization error, model error, parametric uncertainty, V&V vs. UQ, philosophy of models
- 1c: Scaling laws, emergent capabilities, and what they actually predict — power laws vs. capabilities, Schaeffer critique, extrapolation priors, phase transition analogy
Zone 2 — Adjacent expertise (select 1)
- 2a: The hard problem of consciousness and its implications for AI — explanatory gap vs. confusion, functionalism, substrate-independence, consciousness audits
- 2b: Nudge theory under scrutiny — replication failures, nudge vs. manipulation, structural reform, when behavioral science helps policy
- 2c: Emergence in complex systems — explanatory power vs. intellectual placeholder, weak vs. strong emergence, systems thinking's falsifiability
Zone 3 — Stretch domains (select 1)
- 3a: Industrial policy — CHIPS Act, empirical track record, why economists disagree, semiconductor case as special
- 3b: Antibiotic resistance — market failure, commons tragedy, innovation problem, agricultural antibiotics
Selected: 1b (Simulation epistemology), 1c (Scaling laws), 2a (Hard problem of consciousness), 3a (Industrial policy / CHIPS Act). Reserve topics available if needed.
| Style | What the user does | Annotation focus |
|---|---|---|
| Passive | Accept responses, generic follow-ups | Baseline quality; model's default register |
| Effortful-uncritical | Same turn count as critical; request alternatives, detail, examples — never challenge substance | Controls for effort |
| Critical | Full taxonomy of moves | The treatment; which moves drive quality shifts? |
| Passive → critical | Passive for first 5 turns, then shift to critical | Tests whether late engagement can recover from passive start |
Approach: Hybrid — solo open coding first, then independent LLM second pass, then compare.
Per user turn:
- Move type(s) — a single turn can contain multiple moves
- Intensity — light touch vs. forceful challenge (1-3 scale)
- Domain-dependence — did this move require domain knowledge, or was it domain-general?
Per model turn:
- Quality — subjective 1-5 overall quality rating
- Quality shift — improve / flat / degrade vs. previous model turn (+1/0/-1)
- Behavioral markers (check all that apply):
- Hedging shift, precision shift, self-correction, depth shift, frame adoption, defensive retreat, sycophantic agreement
- Latent quality surfaced? — does this response contain substance the model "knew" but wouldn't have produced without critical engagement? (yes/no/unclear)
After each conversation, check: did any new move types emerge? Stop when 2 consecutive conversations yield no new codes. Minimum 5 conversations regardless.
- Length: 12-18 turns per conversation (target 15)
- Platform: Claude.ai
- Model: Opus (richest responses and subtlest errors for art study purposes)
- Max 2 conversations per session to avoid fatigue/pattern-lock
Status: FROZEN (2026-05-03)
The behavioral taxonomy is finalized at 14 moves across 4 categories. Validated via art study: 12 conversations (4 passive, 4 effortful-uncritical, 4 critical) across 4 topics spanning 3 expertise zones. Saturation confirmed — no new move types emerged in the final 2 critical conversations.
Annotation findings incorporated:
- Assumption surfacing is often co-deployed with other moves (especially precision demand and error flagging) rather than appearing standalone. Kept as distinct move — it represents a distinct cognitive operation regardless of delivery vehicle.
- Self-referential consistency check (applying model's own stated criteria against its claims) is a high-value variant of synthesis challenge, not a separate move.
- Domain-expertise correction, deflation resistance, and sycophancy-adjacent pattern detection are subtypes of existing moves (error flagging, scope correction, sycophancy detection respectively).
Phase 0 is exploratory; Phase 1+ is confirmatory. Any taxonomy changes discovered during the experiment are documented but do not alter the experimental conditions mid-run.
- Finalized behavioral taxonomy (with real examples)
- Move frequency distribution
- Move co-occurrence patterns
- Domain-dependence classification
- Principled groupings → graduated prompt levels
- Template system prompts for each condition
Judge model: Different from both the focal LLM and the user LLM (prevents model-specific biases).
Rubric dimensions (scored 1-5 each):
| Dimension | Weight | What it measures | Anchors |
|---|---|---|---|
| Factual accuracy & calibration | ×1.0 | Are claims well-grounded and is confidence proportionate? | 1=major unsupported claims, 3=mostly supported, 5=well-grounded with calibrated uncertainty |
| Logical coherence | ×1.0 | Does the argument follow? | 1=non-sequiturs, 3=mostly coherent, 5=airtight |
| Depth/nuance | ×1.5 | Does it go beyond surface-level? | 1=superficial, 3=competent, 5=expert-level |
| Completeness | ×1.0 | Are important considerations covered? | 1=major omissions, 3=covers main points, 5=comprehensive |
| Intellectual honesty | ×1.5 | Acknowledges uncertainty, limitations, counterarguments? | 1=overconfident, 3=some hedging, 5=calibrated |
| Originality | ×0.75 | Non-obvious insights or connections? | 1=generic/boilerplate, 3=competent, 5=genuinely insightful |
Sensitivity analysis: show results hold across equal-weight and alternative weighting schemes. Pre-registered alternative: originality at ×1.5 (art study showed novel reframing is one of critical engagement's most distinctive contributions; if the effect is stronger under this weighting, it indicates where quality improvement concentrates).
Standardized closing prompt: Every conversation across all 9 conditions ends with a standardized final prompt from the user LLM that produces the evaluable artifact. This prompt is identical across all conditions:
"Now, synthesizing everything from our discussion, provide your most complete and considered response to the original question: [original opening prompt repeated verbatim]."
This ensures: (a) the final response is comparable across conditions — it addresses the same question, (b) it captures whatever quality improvement the conversation produced — the model synthesizes everything it has learned/revised, (c) it is a self-contained artifact scorable without conversation history. Without this standardization, passive final turns ("anything else?") and critical final turns (response to a specific pushback) are incommensurable objects.
Scoring procedure:
- Judge sees ONLY the standardized final response, not conversation history
- The focal LLM system prompt instructs the model to produce self-contained responses (not referencing conversation context like "as you mentioned earlier"), reinforcing comparability
- Judge scores against rubric without knowing which condition produced it
- Each output scored 3 times (stability check)
- Validate on 10% subset against personal ratings
| Task type | Scoring method |
|---|---|
| Constraint satisfaction | Binary per constraint + total count |
| Multi-source synthesis | Checklist: contradictions identified, methodological differences noted, conclusions scoped |
| Specification-sensitive | Checklist: ambiguities flagged, interpretations addressed |
| Fermi estimation | Within 1 order of magnitude + key factors identified |
- Final output: Last substantive model response
- Trajectory: Quality at turn 1, 5, 10, and final turn (captures both "preventing degradation" and "unlocking quality")
- Per-turn improvement: Average quality delta per turn (efficiency metric)
- Degradation detection: Within passive condition, does turn-10 quality < turn-1 quality?
- Latent quality detection: Does critical turn-10 exceed ALL conditions' turn-1 quality?
Trajectory analysis requires LLM-as-judge scoring at 4 checkpoints per conversation (~3× judge API cost).
| Role | Model | Provider | Pricing ($/M tokens) | Mode | Rationale |
|---|---|---|---|---|---|
| Focal LLM | DeepSeek V4 Pro | DeepSeek | $1.74 in / $3.48 out | Real-time | Frontier open-weights; both non-think and think-high bookends |
| User LLM | Claude Sonnet 4.6 | Anthropic | $3.00 in / $15.00 out | Real-time | Frontier; strong system prompt following; no daily rate limits |
| Judge LLM | GPT-5.4 | OpenAI | $2.50 in / $15.00 out | Real-time | Strong evaluator; temperature=0 for reproducibility |
| Role | Model | Provider | Rationale |
|---|---|---|---|
| Focal LLM | Gemini 3.1 Pro | Different architecture from DeepSeek; cross-model generality | |
| User LLM | Claude Sonnet 4.6 | Anthropic | Same user as main — isolates focal model as only variable |
| Judge LLM | GPT-5.4 | OpenAI | Same judge as main — scores are comparable across experiments |
Three independent providers per experiment (DeepSeek/Anthropic/OpenAI for main, Google/Anthropic/OpenAI for replication). User and judge held constant across main and replication — only the focal changes, giving the cleanest possible cross-model comparison.
Key design choices:
- User LLM must be real-time (turn-by-turn conversation with focal). Batch not possible.
- User LLM must be frontier-tier: the user-focal interaction is a sparring contest. A weak user produces weak critiques and the dose-response curve flattens for artifactual reasons.
- User and judge from different providers than focal prevents same-family biases.
- Claude Sonnet 4.6 chosen over Gemini 3.1 Pro due to Gemini's 250 requests/day rate limit (would stretch experiment to ~20 days).
Evaluation checkpoints: 2 per conversation (turn 1 response + standardized closing response). Both answer the original question, making them directly comparable. Intermediate turns are responses to specific follow-ups and not commensurable with opening/closing.
Final-turn composite score — weighted average of 6 rubric dimensions (see §6 for weights).
- Per-dimension scores at final turn
- Quality trajectory slope (improvement rate per turn)
- Turn-1 vs. final-turn delta (within-conversation quality change)
- Per-move quality lift (which taxonomy moves predict biggest quality jumps?)
- Degradation analysis: does passive condition quality decline over turns?
- Cross-model comparison: does the effect size differ between focal LLMs?
- Output quality across conditions (ANOVA or Kruskal-Wallis depending on distributional assumptions)
- Pairwise comparisons (see key comparisons table in §4.1)
- Effect sizes (Cohen's d), not just p-values
- Dose-response: which specific taxonomy moves predict quality improvement? (requires post-hoc annotation of critical condition)
- Task-type interaction: effect for objective vs. open-ended tasks
- Turn/time analysis: quality-per-turn across conditions
- Cross-model: does the effect hold across different focal LLMs?
Instead of pairwise L1-vs-L2 and L2-vs-L3 tests (which are underpowered for medium effects at n=12 tasks), use a monotonic trend test. Mixed-effects model with task as random effect, level as ordinal fixed effect (1, 2, 3), and run as replicate. This uses all three groups simultaneously and has substantially more power than pairwise comparisons.
Pairwise comparisons are secondary. Report effect sizes (Cohen's d) with 95% CIs regardless of significance.
Paired design, α=0.05:
| Comparison | Tasks | Runs | Expected d | Power |
|---|---|---|---|---|
| L3 vs Passive | 12 (Tier 1) | 3 | >1.0 | >0.89 |
| L3 vs Effortful | 12 | 3 | 0.7-1.0 | 0.72-0.89 |
| L1-L2-L3 trend | 12 | 3 | 0.5 (per step) | ~0.65 (trend test) |
| L2 vs L1 pairwise | 27 (all) | 5 | 0.5 | ~0.70 |
The go/no-go gate (after Tier 1) tests L3 > Effortful, which is well-powered even with 12 tasks. The dose-response trend is adequately powered with all tasks. Individual pairwise comparisons between adjacent levels are exploratory, not confirmatory.
Mechanism test: After each L3 conversation, prompt the focal model independently: "Given the topic [X], what are the strongest counterarguments to the mainstream position? What alternative framings might be more productive? What are the key assumptions that should be questioned?"
If the model produces the same insights that the user LLM "pointed at" during conversation, this supports the attentional redirection interpretation (the quality was latent). If it cannot, the user LLM contributed something genuinely novel beyond pointing. This is a low-cost supplementary analysis (~1 additional API call per L3 conversation).
- L3 > Passive (expected, almost certain)
- L3 > Effortful-uncritical (the important test — rules out "just more effort")
- L3 > Best-of-N-passive (rules out "just sampling diversity")
- L3 > Adversarial (validates that targeted pushback matters)
- L3 > L2 > L1 (dose-response — more move categories = better output)
- L1 ≈ Generic critique (evaluative moves may be equivalent to unstructured "find problems")
- Effect larger for open-ended tasks than objective tasks (hypothesis)
| Phase | What | Tasks | Conditions | Runs | Est. conversations | Est. cost (DeepSeek) |
|---|---|---|---|---|---|---|
| Phase | What | Convs | Focal | User (Gemini 3.1 Pro) | Judge (GPT-5.4) | Total |
| ------- | ------ | ------- | ------- | ---------------------- | ---------------- | ------- |
| Pilot | Thinking mode + validation | ~27 | ~$1 | ~$4 | ~$1 | ~$6 |
| Phase 1 | Tier 1, think-high (all 9 conds) | ~324 | ~$37 | ~$181 | ~$16 | ~$235 |
| Phase 1b | Non-think bookend (passive + L3) | ~72 | ~$3 | ~$40 | ~$4 | ~$47 |
| Phase 2 | Extend to 5 runs | +216 | ~$11 | ~$97 | ~$11 | ~$119 |
| Phase 3 | Tier 2 tasks | ~405 | ~$20 | ~$183 | ~$20 | ~$223 |
| Phase 4 | Claude Sonnet replication | deferred | ||||
| Committed (Phase 1+1b) | ~396 | ~$23 | ~$182 | ~$20 | ~$226 |
Revised cost reality (from pilot data):
- Gemini 3.1 Pro dominates costs (~80% of total). ~$0.33/conv passive, ~$0.67/conv critical.
- DeepSeek V4 Pro is cheap. ~$0.05/conv non-think.
- Judge (GPT-5.4) is cheap. ~$0.05/conv (3× scoring).
- Original $163 estimate undercosted Gemini by ~2×.
Bookend design (from pilot Goal 1 findings):
- Non-think: incisive, hedged, compressed — naturally honest. All 9 conditions.
- Think-high: assertive, thorough, confident — overconfident in presentation. Passive + L3 only.
- Think-max ≈ think-high on analytical tasks (same token usage). No separate max level.
- Bookend tests whether critique can crack think-high's overconfidence in addition to unlocking non-think's headroom.
Decision gates:
- After Phase 1: Is L3 > effortful-uncritical with medium+ effect size? If no → stop or redesign.
- After Phase 1: Does the effect hold at both thinking mode bookends?
- After Phase 2: Are results stable with more runs? Proceed to Tier 2.
- After Phase 1 results: revisit Claude replication scope and budget.
Design: 2 tasks (A5, C1) × 3 runs × {non-think, think-high} × passive = 12 conversations.
Findings:
| Metric | Non-think | Think-high |
|---|---|---|
| Mean composite | 4.42 | 4.17 |
| Std deviation | 0.11 | 0.24 |
| Depth/nuance | 4.83 | 4.17 |
| Intellectual honesty | 4.00 | 3.50 |
Qualitative analysis: Think-high produces more assertive, thorough, encyclopedia-like responses. Non-think produces more incisive, hedged, essay-like responses. Think-high resolves ambiguity internally and presents resolved conclusions confidently; non-think preserves ambiguity in the output.
Decision: Run BOTH as bookends. Think-high for all 9 conditions (primary analysis) — ecologically valid, more headroom for dose-response gradient (intellectual honesty at 3.5 gives room for critique to push to 4-5). Non-think for passive + L3 only (robustness check — tests if the effect survives against an already-nuanced baseline where headroom is smaller). Think-max ≈ think-high on analytical tasks (same reasoning token usage), so no separate max level.
Rationale for think-high as default: The dose-response curve (L1→L2→L3) is the primary contribution and needs enough headroom to show a gradient. Non-think at 4.42 baseline may be too close to ceiling. Think-high at 4.17 with overconfident presentation gives critique more to work with. The non-think bookend then answers the stronger scientific question: does the effect hold even when the model is already naturally nuanced?
Cross-judge comparison: Deferred. Human validation subset (10% of responses rated by the researcher) is the calibration check for judge bias.
-
"Sycophancy Is Not One Thing" (ICLR 2026) — sycophantic agreement, genuine agreement, and praise are distinct, independently steerable behaviors in latent space. Critical engagement may suppress sycophancy as a separable mechanism.
-
"Another Turn, Better Output?" (2025) — turn-wise analysis of iterative prompting. 4th iteration offers negligible/negative gains. Our study extends: does turn type matter, not just count?
-
Multi-turn performance degradation — 39% avg performance drop in multi-turn vs. single-turn. GPT-4o: 14.1% conversation correctness in complex scenarios. Critical engagement may prevent degradation, not just unlock quality.
-
MAD gains are mostly majority voting (ICLR 2025) — multi-agent debate doesn't consistently outperform simpler strategies. Strengthens our case: generic disagreement ≠ targeted critical engagement.
-
Self-Refine — ~20pp gains with iterative self-feedback, but doesn't differentiate feedback quality. Our effortful-uncritical vs. critical comparison directly tests this.
-
Generation-verification asymmetry — learning to generate doesn't improve self-verification, but learning to verify DOES improve generation. Critical engagement = external verification, which is more effective than self-supply.
-
Automation bias in LLM-trained physicians (2025) — erroneous LLM recommendations degrade diagnostic performance even in AI-trained MDs. CRT study: warning nudges almost double performance vs. faulty AI.
-
GenAI reduces critical thinking (survey, 2025) — knowledge workers self-report shift from active problem-solving to passive verification.
-
Human-AI complementarity needs augmentation, not emulation (Nature Reviews Psychology, 2026) — effective teaming via complementary strengths.
- vs. MAD: Quality of critique matters (our effortful-uncritical control is the single strongest differentiator — nobody in MAD has tested "same effort, different substance"). Also: empirically grounded behavioral taxonomy, dose-response via graduated levels.
- vs. Self-Refine: We test feedback quality, not just iteration count.
- vs. automation bias: We quantify the gap AND show it's recoverable through specific engagement patterns.
- vs. "just improve the model": Generation-verification asymmetry means external critical engagement provides something the model can't self-supply.
- "Another Turn, Better Output?" — methodology and findings
- "Sycophancy Is Not One Thing" — mechanistic angle
- Nature Reviews Psychology 2026 — complementarity framing
- CRT + nudges study — methodological inspiration
If an LLM prompted to be critical can unlock better outputs from another LLM, why doesn't the focal LLM just produce that quality in the first place?
- Default outputs optimize for plausibility/agreeableness, not correctness
- Models have "knowledge" they don't surface without challenge
- Generation vs. verification asymmetry (easier to recognize good reasoning than produce it unprompted)
- Training incentives: RLHF rewards helpful-seeming responses, not maximally rigorous ones
This is a property of model architecture and training — testable and interesting independent of the human claim.
Stages (headline only — detailed planning per phase):
- Stage 0: Art Study & Taxonomy (~15h)
- Phase 0.1: Topic selection & conversation planning
- Phase 0.2: Conduct art study conversations
- Phase 0.3: Annotation & taxonomy derivation
- Phase 0.4: System prompt authoring
- Stage 1: Experiment Infrastructure (~20h)
- Stage 2: Tier 1 Experiment (~15h)
- Stage 3: Tier 2 & Replication (~20h)
- Stage 4: Writeup (~15h)
Detailed phase breakdowns live in docs/phases/.
-
Final art study topic selection→ 1b, 1c, 2a, 3a (4 reserve topics available) -
ChatGPT export analysis→ Confirmed topic selection and taxonomy coverage (12/13 moves pre-sycophancy-detection). Sycophancy detection added as 14th move based on cross-platform evidence. - Finalize task tiering (Tier 1 vs. Tier 2)
-
Graduated levels: from start (3 levels) vs. add later?→ Graduated from start (3 levels) - Objective task redesign (current proposals may be too easy for frontier models)
Research blog post (Anthropic-style) or workshop paper (CHI, CSCW, NeurIPS workshop). Rigorous but accessible to both researchers and practitioners.
- Introduction: LLMs are powerful but passive use underperforms. Critical thinking's role hasn't diminished.
- Behavioral taxonomy of critical thinking moves in LLM interaction (contribution #1)
- Experimental design: art study → grounded LLM-LLM simulation
- Results (contribution #2)
- Discussion: implications for human LLM use, agentic system design, sycophancy (contribution #3)
- Limitations: LLM-LLM as proxy (acknowledged, with "necessary condition" argument)