Skip to content

Conversation

@artefactop
Copy link

Summary

This PR fixes handling of the reasoning field in OpenAI-compatible models like gpt-oss-20b.

Problem

Previously, when a model returned both reasoning and content fields, only the reasoning was extracted as output, losing the actual content/answer.

Solution

  • Extract both reasoning and content fields separately
  • Format output as Thinking: {reasoning}\n\n{content} when showThinking is enabled (default)
  • Return only content when showThinking: false
  • Add gpt-oss prefix to the list of models that support the reasoning_effort parameter

Changes

  • Modified src/providers/openai/chat.ts to properly handle reasoning field
  • Added comprehensive test coverage for both scenarios
  • Updated CHANGELOG.md

Testing

Added two test cases:

  1. Verify reasoning + content are both extracted and formatted correctly
  2. Verify reasoning is hidden when showThinking: false

References

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Oct 30, 2025

📝 Walkthrough

Walkthrough

This change fixes how OpenAI-compatible models (such as gpt-oss-20b) handle the reasoning field in chat responses. Previously, reasoning content was extracted but not properly combined with message content. The update introduces a separate reasoning variable during chat reply processing, prepends "Thinking: {reasoning}" to the output when reasoning exists (controlled by the showThinking flag), and extends reasoning pathway support to models starting with gpt-oss alongside existing o1/o3/o4/gpt-5 models. Output construction now prioritizes message content with reasoning-derived thinking text prepended when applicable.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

  • Logic flow changes in core provider: The chat.ts file modifies output construction with conditional reasoning handling and thinking text injection
  • New model type coverage: Extension of reasoning pathway to gpt-oss models alongside existing reasoning-capable model types requires verification
  • Multiple test paths: Tests validate both showThinking: true and showThinking: false scenarios, as well as specific model behavior
  • State handling complexity: Tracking separate reasoning variable and integrating it into the existing tool callback and structured output parsing flows introduces conditional branching to verify

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Title Check ✅ Passed The PR title "fix: handle reasoning and content fields for OpenAI-compatible models like gpt-oss-20b" directly and specifically describes the main change in the pull request. The changeset focuses on fixing how reasoning and content fields are extracted and formatted for OpenAI-compatible models (particularly gpt-oss-20b), which is exactly what the title conveys. The title is concise, clear, and avoids vague terminology or noise, making it immediately understandable to someone scanning the project history.
Description Check ✅ Passed The PR description is well-structured and clearly related to the changeset. It explains the problem (reasoning field extraction was losing content), the solution (extracting both fields separately and formatting based on showThinking), the specific files modified (openai/chat.ts, test file, CHANGELOG.md), and the test cases added. The description aligns with the changes shown in the raw summary and provides sufficient context about the implementation details and the motivation for the fix.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/providers/openai/chat.ts (1)

419-454: Add type check for string output before concatenation.

The reasoning handling at line 453 concatenates strings without verifying that output is a string, unlike the reasoning_content handling at line 569 which includes typeof output === 'string'. If message.content is null, undefined, or an object, the concatenation could produce unexpected results like "Thinking: ...\n\nnull" or "Thinking: ...\n\n[object Object]".

Apply this diff to add consistent type checking:

-      // Handle reasoning as thinking content if present and showThinking is enabled
-      if (reasoning && (this.config.showThinking ?? true)) {
-        output = `Thinking: ${reasoning}\n\n${output}`;
-      }
+      // Handle reasoning as thinking content if present and showThinking is enabled
+      if (
+        reasoning &&
+        typeof output === 'string' &&
+        (this.config.showThinking ?? true)
+      ) {
+        output = `Thinking: ${reasoning}\n\n${output}`;
+      }
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c033ed8 and 2a6a6ab.

📒 Files selected for processing (3)
  • CHANGELOG.md (1 hunks)
  • src/providers/openai/chat.ts (3 hunks)
  • test/providers/openai/chat.test.ts (1 hunks)
🧰 Additional context used
📓 Path-based instructions (11)
CHANGELOG.md

📄 CodeRabbit inference engine (.cursor/rules/changelog.mdc)

CHANGELOG.md: Document all user-facing changes in CHANGELOG.md
Every pull request must add or update an entry in CHANGELOG.md under the [Unreleased] section
Follow Keep a Changelog structure under [Unreleased] with sections: Added, Changed, Fixed, Dependencies, Documentation, Tests, Removed
Each changelog entry must include the PR number formatted as (#1234) or temporary placeholder (#XXXX)
Each changelog entry must use a Conventional Commit prefix: feat:, fix:, chore:, docs:, test:, or refactor:
Each changelog entry must be concise and on a single line
Each changelog entry must be user-focused, describing what changed and why it matters to users
Each changelog entry must include a scope in parentheses, e.g., feat(providers): or fix(evaluator):
Use common scopes for consistency: providers, evaluator, webui or app, cli, redteam, core, assertions, config, database
Place all dependency updates under the Dependencies category
Place all test changes under the Tests category
Use categories consistently: Added for new features, Changed for modifications/refactors/CI, Fixed for bug fixes, Removed for removed features
After a PR number is assigned, replace (#XXXX) placeholders with the actual PR number
Be specific, use active voice, include context, and avoid repeating the PR title in changelog entries
Group related changes with multiple bullets in the same category when needed; use one entry per logical change

CHANGELOG.md: All user-facing changes require a CHANGELOG.md entry before creating a PR
Add entries under [Unreleased] in appropriate category (Added, Changed, Fixed, Dependencies, Documentation, Tests)
Each changelog entry must include PR number (#1234) or placeholder (#XXXX)
Use conventional commit prefixes in changelog entries (feat:, fix:, chore:, docs:, test:, refactor:)

CHANGELOG.md: Document all user-facing changes in CHANGELOG.md
Changelog entries must include the PR number in format (#1234)
Use conventional commit prefixes in changelog entries: feat:,...

Files:

  • CHANGELOG.md
test/**/*.{test,spec}.ts

📄 CodeRabbit inference engine (.cursor/rules/jest.mdc)

test/**/*.{test,spec}.ts: Mock as few functions as possible to keep tests realistic
Never increase the function timeout - fix the test instead
Organize tests in descriptive describe and it blocks
Prefer assertions on entire objects rather than individual keys when writing expectations
Clean up after tests to prevent side effects (e.g., use afterEach(() => { jest.resetAllMocks(); }))
Run tests with --randomize flag to ensure your mocks setup and teardown don't affect other tests
Use Jest's mocking utilities rather than complex custom mocks
Prefer shallow mocking over deep mocking
Mock external dependencies but not the code being tested
Reset mocks between tests to prevent test pollution
For database tests, use in-memory instances or proper test fixtures
Test both success and error cases for each provider
Mock API responses to avoid external dependencies in tests
Validate that provider options are properly passed to the underlying service
Test error handling and edge cases (rate limits, timeouts, etc.)
Ensure provider caching behaves as expected
Always include both --coverage and --randomize flags when running tests
Run tests in a single pass (no watch mode for CI)
Ensure all tests are independent and can run in any order
Clean up any test data or mocks after each test

Files:

  • test/providers/openai/chat.test.ts
test/**/*.test.ts

📄 CodeRabbit inference engine (test/CLAUDE.md)

test/**/*.test.ts: Never increase Jest test timeouts; fix slow tests instead (avoid jest.setTimeout or large timeouts in tests)
Do not use .only() or .skip() in committed tests
Add afterEach(() => { jest.resetAllMocks(); }) to ensure mock cleanup
Prefer asserting entire objects (toEqual on whole result) rather than individual fields
Mock minimally: only external dependencies (APIs, databases), not code under test
Use Jest (not Vitest) APIs in this suite; avoid importing vitest
Import from @jest/globals in tests

Files:

  • test/providers/openai/chat.test.ts
test/**

📄 CodeRabbit inference engine (test/CLAUDE.md)

Organize tests to mirror src/ structure (e.g., test/providers → src/providers, test/redteam → src/redteam)

Place tests under test/

Files:

  • test/providers/openai/chat.test.ts
test/providers/**/*.test.ts

📄 CodeRabbit inference engine (test/CLAUDE.md)

Provider tests must cover: success case, error cases (4xx, 5xx, rate limits), configuration validation, and token usage tracking

Files:

  • test/providers/openai/chat.test.ts
**/*.{test,spec}.{js,jsx,ts,tsx}

📄 CodeRabbit inference engine (.cursor/rules/gh-cli-workflow.mdc)

Avoid disabling or skipping tests (e.g., .skip, .only) unless absolutely necessary and documented

Files:

  • test/providers/openai/chat.test.ts
**/*.{ts,tsx}

📄 CodeRabbit inference engine (.cursor/rules/gh-cli-workflow.mdc)

Prefer not to introduce new TypeScript types; reuse existing interfaces where possible

**/*.{ts,tsx}: Maintain consistent import order (Biome handles sorting)
Use consistent curly braces for all control statements
Prefer const over let and avoid var
Use object shorthand syntax when possible
Use async/await for asynchronous code
Use consistent error handling with proper type checks

**/*.{ts,tsx}: Use TypeScript with strict type checking enabled
Follow consistent import order (Biome will sort imports)
Use consistent curly braces for all control statements
Prefer const over let; avoid var
Use object property shorthand when possible
Use async/await for asynchronous code instead of raw promises/callbacks
When logging, pass sensitive data via the logger context object so it is auto-sanitized; avoid interpolating secrets into message strings
Manually sanitize sensitive objects with sanitizeObject before storing or emitting outside logging contexts

Files:

  • test/providers/openai/chat.test.ts
  • src/providers/openai/chat.ts
test/**/*.{test,spec}.{ts,tsx}

📄 CodeRabbit inference engine (AGENTS.md)

test/**/*.{test,spec}.{ts,tsx}: Follow Jest best practices using describe/it blocks
Test both success and error cases for all functionality

Files:

  • test/providers/openai/chat.test.ts
test/**/*.{ts,tsx}

📄 CodeRabbit inference engine (CLAUDE.md)

Follow Jest best practices with describe/it blocks in tests

Files:

  • test/providers/openai/chat.test.ts
src/providers/**/*.ts

📄 CodeRabbit inference engine (src/providers/CLAUDE.md)

src/providers/**/*.ts: Each provider must implement the ApiProvider interface (see src/types/providers.ts)
Providers must transform promptfoo prompts into the provider-specific API request format
Providers must return a normalized ProviderResponse for evaluation
Providers must handle authentication, rate limits, retries, and streaming
Always sanitize logs to prevent leaking API keys; never stringify configs directly in logs
For OpenAI-compatible providers, extend OpenAiChatCompletionProvider and configure apiBaseUrl and options via super(...)
Follow configuration priority: (1) explicit config options, (2) environment variables (PROVIDER_API_KEY), (3) provider defaults

Files:

  • src/providers/openai/chat.ts
src/**

📄 CodeRabbit inference engine (AGENTS.md)

Place core application/library logic under src/

Files:

  • src/providers/openai/chat.ts
🧠 Learnings (7)
📚 Learning: 2025-10-05T16:58:47.598Z
Learnt from: CR
PR: promptfoo/promptfoo#0
File: src/providers/CLAUDE.md:0-0
Timestamp: 2025-10-05T16:58:47.598Z
Learning: Applies to src/providers/**/*.ts : For OpenAI-compatible providers, extend OpenAiChatCompletionProvider and configure apiBaseUrl and options via super(...)

Applied to files:

  • test/providers/openai/chat.test.ts
📚 Learning: 2025-10-05T17:00:56.424Z
Learnt from: CR
PR: promptfoo/promptfoo#0
File: test/CLAUDE.md:0-0
Timestamp: 2025-10-05T17:00:56.424Z
Learning: Applies to test/providers/**/*.test.ts : Provider tests must cover: success case, error cases (4xx, 5xx, rate limits), configuration validation, and token usage tracking

Applied to files:

  • test/providers/openai/chat.test.ts
📚 Learning: 2025-10-05T16:58:47.598Z
Learnt from: CR
PR: promptfoo/promptfoo#0
File: src/providers/CLAUDE.md:0-0
Timestamp: 2025-10-05T16:58:47.598Z
Learning: Applies to src/providers/test/providers/**/*.test.{ts,tsx,js,jsx} : Provider tests must cover rate limits, timeouts, and invalid configuration scenarios

Applied to files:

  • test/providers/openai/chat.test.ts
📚 Learning: 2025-10-05T16:58:47.598Z
Learnt from: CR
PR: promptfoo/promptfoo#0
File: src/providers/CLAUDE.md:0-0
Timestamp: 2025-10-05T16:58:47.598Z
Learning: Applies to src/providers/test/providers/**/*.test.{ts,tsx,js,jsx} : Provider tests must cover both success and error cases

Applied to files:

  • test/providers/openai/chat.test.ts
📚 Learning: 2025-07-18T17:26:23.087Z
Learnt from: CR
PR: promptfoo/promptfoo#0
File: .cursor/rules/jest.mdc:0-0
Timestamp: 2025-07-18T17:26:23.087Z
Learning: Applies to test/**/*.{test,spec}.ts : Ensure provider caching behaves as expected

Applied to files:

  • test/providers/openai/chat.test.ts
📚 Learning: 2025-07-18T17:25:46.665Z
Learnt from: CR
PR: promptfoo/promptfoo#0
File: .cursor/rules/examples.mdc:0-0
Timestamp: 2025-07-18T17:25:46.665Z
Learning: Applies to examples/*/promptfooconfig.yaml : For OpenAI, prefer models like 'openai:o3-mini' and 'openai:gpt-4o-mini' in configuration files

Applied to files:

  • src/providers/openai/chat.ts
📚 Learning: 2025-07-18T17:25:38.444Z
Learnt from: CR
PR: promptfoo/promptfoo#0
File: .cursor/rules/examples.mdc:0-0
Timestamp: 2025-07-18T17:25:38.444Z
Learning: Applies to examples/*/promptfooconfig*.yaml : For OpenAI, prefer models like 'openai:o3-mini' and 'openai:gpt-4o-mini' in configuration files

Applied to files:

  • src/providers/openai/chat.ts
🧬 Code graph analysis (1)
test/providers/openai/chat.test.ts (1)
src/providers/openai/chat.ts (1)
  • OpenAiChatCompletionProvider (25-625)
🔇 Additional comments (4)
src/providers/openai/chat.ts (2)

286-287: LGTM! Support for gpt-oss models added correctly.

The addition of gpt-oss prefix to the reasoning_effort parameter handling appropriately extends support to OpenAI-compatible models like gpt-oss-20b.


456-563: Based on my verification of OpenAI's documentation and reasoning model best practices, the behavior in your code is intentional and correct. Reasoning tokens in reasoning models are hidden tokens not returned as part of the message response content but used internally to help generate answers. For safety, reasoning tokens are exposed to users only in summarized form.

When function/tool calls are executed, responses are extracted as separate output types: reasoning, function_calls, and messages. The test expectations you have (returning tool calls without reasoning) align with this API design—tool results are the primary user-facing output.

test/providers/openai/chat.test.ts (2)

355-384: LGTM! Comprehensive test for reasoning extraction.

This test correctly validates that both reasoning and content are extracted from the response and formatted with the "Thinking:" prefix when showThinking is enabled (default behavior). The test uses the new gpt-oss-20b model and verifies the complete output format.


386-415: LGTM! Test correctly validates showThinking configuration.

This test properly verifies that when showThinking: false is configured, only the content is returned without the thinking prefix, ensuring users can control whether reasoning is displayed in the output.

@mldangelo
Copy link
Member

mldangelo commented Oct 31, 2025

Hi, thanks so much @artefactop! Can you please make this PR editable? I'll get it merged! Also reach out if you'd like some promptfoo swag!

@artefactop
Copy link
Author

artefactop commented Nov 3, 2025

Hi, thanks so much @artefactop! Can you please make this PR editable? I'll get it merged! Also reach out if you'd like some promptfoo swag!

Hi @mldangelo, it should be editable now.

gh api repos/promptfoo/promptfoo/pulls/6062 --jq '.maintainer_can_modify'
true

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants