Skip to content

kar-ganap/crit-thinking

Repository files navigation

Same Question, Different Answer

Latent Quality in LLMs Under Critical Engagement

This repository contains the code, prompts, raw conversation data, judge outputs, and LaTeX source for the paper Same Question, Different Answer: Latent Quality in LLMs Under Critical Engagement (paper/main.pdf · paper/main.tex).

What the paper finds

LLM default single-turn outputs systematically underrepresent what the model can produce under sustained critical user engagement. Across 14 open-ended analytical tasks and 9 engagement conditions, critical engagement substantially exceeds best-of-$N$ sampling, and the effect decomposes into two structurally distinct modes:

  • Evaluative critique (three error-and-gap-flagging moves) drives epistemic calibration — Cohen's $d = 1.51$ vs.\ sampling baseline.
  • Comprehensive critique (the full 14-move taxonomy, adding elicitation, generative, and calibration moves) drives analytical novelty — Cohen's $d = 2.54$.

Both effects persist against a passive-engagement control, replicate across focal models (DeepSeek V4 Pro → Grok 4.3), and survive multiple-comparison correction. The two modes are not points on a single dose-response axis: a controlled persistence-directive ablation (Appendix B) shows that combining them via prompt instruction degrades both. Full results in the paper.

Repository layout

Path What it is
paper/ Paper LaTeX source, references, figures and figure-generation scripts (paper/figures/), and prose-number analysis scripts (paper/analysis/). paper/main.pdf is the compiled manuscript.
src/crit_thinking/ Python package: experiment runner, model clients (Anthropic, OpenAI, DeepSeek, Grok, Google), conversation pipeline, judge pipeline, storage layer.
src/crit_thinking/prompts/ All 9 user-LLM system prompts (one per engagement condition) plus the 2 judge prompts (per-dimension 7-point rubric, position-bias-corrected pairwise). Same files loaded at runtime and reproduced verbatim in Appendix B.
src/crit_thinking/tasks/ The 16 analytical tasks (opening prompts and metadata).
data/experiments/ Raw transcripts and judge outputs for all 327 conversations across the primary, replication, ablation, bookend, and pilot studies. See data/experiments/README.md for layout.
tests/ Unit and integration tests for the experiment infrastructure.
art_study/ The taxonomy-pilot conversations (one human researcher engaging Claude Opus 4.7 across 4 overlap topics), plus the operational codebook and the critical-thinking reference card used to scaffold the engagement.
docs/ Conceptual notes, the measurement framework, and the phase-by-phase project journal (docs/phases/).
literature-review/ Reviews of the three closest prior works (Self-Refine, Multi-Agent Debate, Another Turn) that gated the design. Blocks B/C were de-scoped.
research_design.md The authoritative pre-registered design document.

Reproducing the paper's numbers

Every figure and reported number regenerates from a committed script against committed data.

# 1. Set up the environment
uv sync                       # uses pyproject.toml + uv.lock

# 2. Reproduce figures (output goes to paper/figures/)
uv run python paper/figures/fig1_taxonomy.py
uv run python paper/figures/fig2_dimension_heatmap.py
uv run python paper/figures/fig3_effect_sizes.py
# ... etc

# 3. Reproduce prose numbers (output JSON in paper/analysis/)
uv run python paper/analysis/objective_detection.py
uv run python paper/analysis/art_study_pairwise.py
uv run python paper/analysis/v2_followthrough.py
uv run python paper/analysis/v2_conversation_length.py

# 4. Build the paper
cd paper && make pdf

The analysis and figure scripts read from data/experiments/ and do not make any API calls. To re-run the experiments themselves (which does require API credentials), copy .env.example to .env, fill in keys for Anthropic / OpenAI / DeepSeek / xAI / Google, then drive the runner from src/crit_thinking/scripts/.

Requirements

  • Python 3.11+ · uv for environment management
  • For paper rebuild: TeX Live 2025 or equivalent (the paper uses newtx, fvextra, cleveref, tikz, and standard ML-paper packages)
  • For experiment re-runs: API access to DeepSeek (focal, primary) and optionally Grok (focal, replication), Anthropic (user-LLM), OpenAI (judge)

Citation

@misc{bhat2026samequestion,
  title  = {Same Question, Different Answer: Latent Quality in LLMs
            Under Critical Engagement},
  author = {Bhat, Kartik G},
  year   = {2026},
  url    = {https://github.com/kar-ganap/crit-thinking}
}

License

MIT — see LICENSE. This covers code, prompts, conversation transcripts, and the paper LaTeX source. Use freely with attribution.

Contact

Kartik G Bhat · gkartik@gmail.com

Releases

No releases published

Packages

 
 
 

Contributors