Skip to content

Tighten CPython parity for str format spec, %-format, and str() constructor#7769

Merged
youknowone merged 1 commit into
RustPython:mainfrom
changjoon-park:fix-str-format-constructor-cformat
May 4, 2026
Merged

Tighten CPython parity for str format spec, %-format, and str() constructor#7769
youknowone merged 1 commit into
RustPython:mainfrom
changjoon-park:fix-str-format-constructor-cformat

Conversation

@changjoon-park

@changjoon-park changjoon-park commented May 3, 2026

Copy link
Copy Markdown
Contributor

Summary

Eight CPython parity gaps in str formatting and construction, grouped into three themes.

Group 1 — str() constructor decode mode

CPython's unicode_new_impl (Objects/unicodeobject.c) enters decode mode when encoding OR errors is provided, with separate early-reject paths for str input and non-bytes input. RustPython only checked encoding, and the non-bytes path surfaced a generic "a bytes-like object is required" error.

>>> str(b'foo', errors='strict')
# CPython 3.14.4: 'foo'
# RustPython main: "b'foo'"  ❌

>>> str('hello', errors='strict')
# CPython 3.14.4: TypeError: decoding str is not supported
# RustPython main: TypeError: a bytes-like object is required, not 'str'  ❌

>>> str(123, errors='strict')
# CPython 3.14.4: TypeError: decoding to str: need a bytes-like object, int found
# RustPython main: TypeError: a bytes-like object is required, not 'int'  ❌

Group 2 — str.format() spec engine

CPython's format spec engine (Python/formatter_unicode.c) truncates to precision before applying width pad. RustPython applied precision after, so the pad got truncated off. Separately, '{}'.format() with insufficient args wrapped Rust's tuple-index error rather than emitting CPython's domain-specific message.

>>> '{0:3.2s}'.format('abc')
# CPython 3.14.4: 'ab '
# RustPython main: 'ab'  ❌

>>> '{}'.format()
# CPython 3.14.4: IndexError: Replacement index 0 out of range for positional args tuple
# RustPython main: IndexError: tuple index out of range  ❌

The IndexError wording change covers both auto-numbered and explicit-index paths.

Group 3 — %-format integer dispatch and wording

CPython's %-format path (Objects/unicodeobject.c::formatlong and adjacent) dispatches integer conversions through PyNumber_Index (the __index__ protocol), and uses "a real number" / "an int or a unicode character" wording. RustPython only tried __int__ for %d / %i / %u, never tried __index__ for the integer-conversion error path on %x / %o / %X / %c, and used divergent wording.

>>> class P:
...     def __index__(self): return 42

>>> '%x' % P()
# CPython 3.14.4: '2a'
# RustPython main: TypeError: %x format: an integer is required, not P  ❌

>>> '%d' % P()
# CPython 3.14.4: '42'
# RustPython main: TypeError: %d format: a number is required, not P  ❌

>>> '%c' % 'foo'
# CPython 3.14.4: TypeError: %c requires an int or a unicode character, not str
# RustPython main: TypeError: %c requires int or char  ❌

%d / %i / %u now prefer __index__ over __int__ (matching CPython), and the "a number" wording becomes "a real number" for the complex/float-as-decimal mismatch. Applied to both bytes and str %-format paths.

Verification (CPython 3.14.4)

Probe set Cases Match
str format spec width × precision × content 252 252/252
%-format integer dispatch ({fmt} × {int, bool, P}) 42 42/42
str() constructor combinations + edge cases 29 29/29
'{}'.format() IndexError variations 6 6/6
Non-ASCII format (CharLen for String) 4 4/4
% format with %s / %r / %a 33 33/33
% format with floats 49 49/49
% format with mappings 4 4/4
str.format() keyword + attr access 5 5/5
format() builtin 51 51/51
Total 475 475/475

cargo run --release -- -m test test_str test_bytes test_format test_codecs test_io test_unicode_identifiers — 1,429 tests pass, 0 regressions. All 188 extra_tests/snippets/*.py pass under the CI feature set. cargo test rustpython-common cformat — 11 unit tests pass (including non-ASCII).

Tests unmasked

  • Lib/test/test_str.py::test_constructor_keyword_args
  • Lib/test/test_str.py::test_constructor_defaults
  • Lib/test/test_descr.py::ClassPropertiesAndMethods.test_keywords — basic-type constructor parity aggregator; str(object=500) and str(object=b'abc', errors='strict') paths now pass.

Out of scope (markers retained, with updated reasons)

  • test_format'{0:08s}'.format('result') treats 0 as fill+left-align in CPython but as numeric zero-pad in RustPython's spec parser. Separate format-spec parser concern.
  • test_formatting%c error message expects fully-qualified module.qualname; RustPython returns the bare class name. Broader tp_name qualification concern.

@coderabbitai

coderabbitai Bot commented May 3, 2026

Copy link
Copy Markdown
Contributor
📝 Walkthrough

Walkthrough

This PR makes three orthogonal changes: string formatting now truncates to precision before alignment; the str() constructor’s fast-path and decode semantics are aligned with CPython (encoding/errors rules and errors behavior); and format specifiers plus positional field errors were extended to accept __index__ and produce clearer index-out-of-range messages.

Changes

String Formatting Precision Handling

Layer / File(s) Summary
Precision Logic
crates/common/src/format.rs
FormatSpec::format_string now truncates the input string to precision characters before calling format_sign_and_align (previously truncation happened after alignment).
Utility Implementation
crates/common/src/format.rs
Added impl CharLen for String that returns self.chars().count() so alignment computes character length correctly.

String Constructor Behavior

Layer / File(s) Summary
Fast-Path Condition
crates/vm/src/builtins/str.rs
PyStr::slot_new fast-path (returning the exact str object) now requires both encoding and errors to be missing rather than only encoding.
Decode Logic / Allocation
crates/vm/src/builtins/str.rs
py_new now enters decode mode when either encoding or errors is present; rejects str inputs in decode mode, rejects non-bytes-like inputs with a bytes-like TypeError, and defaults encoding to "utf-8" when only errors is provided. If neither is present, falls back to input.str(vm) as before.

Format Conversion CPython Parity

Layer / File(s) Summary
Integer Conversion (%d/%i/%u) — bytes
crates/vm/src/cformat.rs
spec_format_bytes accepts objects implementing __index__ via try_index_opt in addition to PyInt, matching CPython behavior.
Integer Conversion (%x/%o/%X) — bytes
crates/vm/src/cformat.rs
spec_format_bytes extended to accept __index__ objects for non-decimal integer formats.
Character %c — bytes
crates/vm/src/cformat.rs
%c accepts PyInt or any __index__ integer (range-checked 0–255) and improves error messages when types mismatch.
Integer Conversion (%d/%i/%u) — strings
crates/vm/src/cformat.rs
spec_format_string accepts __index__ objects in addition to PyInt for decimal conversions.
Integer Conversion (%x/%o/%X) — strings
crates/vm/src/cformat.rs
spec_format_string accepts __index__ objects for hexadecimal/octal conversions.
Character %c — strings
crates/vm/src/cformat.rs
%c now prefers a single-character str first; otherwise accepts PyInt or __index__ integers convertible to Unicode code points (0x0..=0x10FFFF); error messages updated.
Field Resolution Errors
crates/vm/src/format.rs
format() now raises index-specific new_index_error messages for FieldType::Auto and FieldType::Index when positional arguments are out of range, e.g., "Replacement index {idx} out of range for positional args tuple".

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Suggested reviewers

  • ShaharNaveh
  • youknowone

Poem

🐰
I nibbled bytes, I trimmed the line,
Precision first, then pad just fine.
Decode when errors or encodings show,
Index-aware formats now steal the show.
Hoppity parity—rabbit says, "Bravo!"

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title directly and clearly summarizes the main changes: CPython parity improvements for str formatting (format spec, %-format) and str() constructor behavior.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions

github-actions Bot commented May 3, 2026

Copy link
Copy Markdown
Contributor

📦 Library Dependencies

The following Lib/ modules were modified. Here are their dependencies:

[x] test: cpython/Lib/test/test_str.py (TODO: 10)
[x] test: cpython/Lib/test/test_fstring.py (TODO: 19)
[x] test: cpython/Lib/test/test_string_literals.py (TODO: 4)

dependencies:

dependent tests: (no tests depend on str)

[ ] test: cpython/Lib/test/test_descr.py (TODO: 36)
[ ] test: cpython/Lib/test/test_descrtut.py (TODO: 3)

dependencies:

dependent tests: (no tests depend on descr)

Legend:

  • [+] path exists in CPython
  • [x] up-to-date, [ ] outdated

…ructor

Five related CPython parity gaps in `str` formatting and construction:

1. **`str(bytes, errors=...)` triggers decode mode.** Previously, only
   `encoding=` triggered decode; passing only `errors=` fell back to
   `repr()`. CPython's behavior: presence of `encoding` OR `errors`
   triggers decode mode (default UTF-8 when only `errors` is given).

2. **`'{...}'.format() IndexError wording.** Generic Rust "tuple index
   out of range" replaced with CPython's "Replacement index N out of
   range for positional args tuple".

3. **`{0:3.2s}.format('abc')` → 'ab '.** String format spec applied
   precision after width padding; CPython truncates BEFORE padding.
   Reorder the operations.

4. **`%x` / `%o` / `%X` / `%c` accept `__index__` objects.** Previously
   only `PyInt` downcast was attempted. Mirror CPython's
   PyNumber_Index dispatch via `try_index_opt`.

5. **`%d` / `%u` / `%i` error wording.** "a number is required" →
   "a real number is required" (matches CPython).

Also adds `not <type>` suffix to `%c` error messages so the type is
visible in TypeError text (matches CPython structure even without
fully-qualified names).

Verified byte-identical with CPython 3.14.4 across 25+ probes covering
the format/spec/constructor combinations. Unmasks
`test_str.test_constructor_keyword_args` and
`test_str.test_constructor_defaults`. test_str/test_bytes/test_format/
test_codecs/test_io/test_unicode_identifiers — 1,429 tests pass, 0
regressions. All 188 `extra_tests/snippets/*.py` pass under the CI
feature set.

`test_str.test_format` and `test_str.test_formatting` markers retained:
`test_format` still trips on `'{0:08s}'.format('result')` (numeric
zero-pad treated as fill+left-align by CPython for str type — separate
format-spec parser concern). `test_formatting` still trips on
`%c` error message expecting fully qualified `module.qualname` (RP
returns bare class name — separate broader concern).
@changjoon-park changjoon-park force-pushed the fix-str-format-constructor-cformat branch from 62a1279 to e4c9aaf Compare May 3, 2026 17:58

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
crates/vm/src/cformat.rs (1)

74-85: ⚠️ Potential issue | 🟠 Major

Surface __int__ protocol failures instead of the generic error.

When __int__ returns a non-int value, Lines 74-80 and 192-198 fall through to the generic %d format: a real number is required error. CPython 3.14 instead raises TypeError: int returned non-int (type <T>) when __int__ violates the protocol. Raise the specific protocol error to match CPython's behavior.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/vm/src/cformat.rs` around lines 74 - 85, When calling the __int__
protocol (via vm.get_method(..., identifier!(vm, __int__)) and
method?.call(...)) detect when the returned value is not a PyInt and raise a
protocol-specific TypeError instead of falling through to the generic "%d
format" error; in the block around vm.get_method / identifier!(vm, __int__) (and
the analogous block at lines ~192-198) return vm.new_type_error with a message
like "int returned non-int (type <T>)" using the returned object's class name
rather than proceeding to check_int_to_str_digits or spec.format_number so the
violation is surfaced per CPython 3.14 behavior.
🧹 Nitpick comments (1)
crates/vm/src/cformat.rs (1)

67-97: 🏗️ Heavy lift

Consider extracting the new coercion paths into shared helpers.

The __index__ / __int__ dispatch and %c coercion logic is now duplicated in both spec_format_bytes() and spec_format_string(). These are parity-sensitive branches, so keeping one shared implementation would reduce the chance of the two paths drifting again on the next wording or precedence tweak.

As per coding guidelines, "When branches differ only in a value but share common logic, extract the differing value first, then call the common logic once to avoid duplicate code".

Also applies to: 120-145, 185-215, 229-251

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/vm/src/cformat.rs` around lines 67 - 97, Extract the duplicated
`__index__`/`__int__` coercion and `%c` handling into shared helper(s) and call
them from spec_format_bytes() and spec_format_string(): create one helper (e.g.,
coerce_to_index_or_int(...)) that encapsulates the try_index_opt(vm) branch, the
vm.get_method(... identifier!(vm, __int__)) call, downcast to PyInt,
check_int_to_str_digits, and returning the BigInt (or an error), and another
small helper for the `%c`-specific validation if needed; then replace the
duplicated blocks in spec_format_bytes() and spec_format_string() with calls to
these helpers and use the returned BigInt with spec.format_number(...) (or the
char logic) so both functions share the same parity-sensitive dispatch logic.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@crates/vm/src/cformat.rs`:
- Around line 74-85: When calling the __int__ protocol (via vm.get_method(...,
identifier!(vm, __int__)) and method?.call(...)) detect when the returned value
is not a PyInt and raise a protocol-specific TypeError instead of falling
through to the generic "%d format" error; in the block around vm.get_method /
identifier!(vm, __int__) (and the analogous block at lines ~192-198) return
vm.new_type_error with a message like "int returned non-int (type <T>)" using
the returned object's class name rather than proceeding to
check_int_to_str_digits or spec.format_number so the violation is surfaced per
CPython 3.14 behavior.

---

Nitpick comments:
In `@crates/vm/src/cformat.rs`:
- Around line 67-97: Extract the duplicated `__index__`/`__int__` coercion and
`%c` handling into shared helper(s) and call them from spec_format_bytes() and
spec_format_string(): create one helper (e.g., coerce_to_index_or_int(...)) that
encapsulates the try_index_opt(vm) branch, the vm.get_method(... identifier!(vm,
__int__)) call, downcast to PyInt, check_int_to_str_digits, and returning the
BigInt (or an error), and another small helper for the `%c`-specific validation
if needed; then replace the duplicated blocks in spec_format_bytes() and
spec_format_string() with calls to these helpers and use the returned BigInt
with spec.format_number(...) (or the char logic) so both functions share the
same parity-sensitive dispatch logic.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: 28b0e721-5781-41f4-b9f1-77fca72a50b0

📥 Commits

Reviewing files that changed from the base of the PR and between 62a1279 and e4c9aaf.

⛔ Files ignored due to path filters (2)
  • Lib/test/test_descr.py is excluded by !Lib/**
  • Lib/test/test_str.py is excluded by !Lib/**
📒 Files selected for processing (4)
  • crates/common/src/format.rs
  • crates/vm/src/builtins/str.rs
  • crates/vm/src/cformat.rs
  • crates/vm/src/format.rs
✅ Files skipped from review due to trivial changes (1)
  • crates/vm/src/format.rs
🚧 Files skipped from review as they are similar to previous changes (1)
  • crates/vm/src/builtins/str.rs

@youknowone youknowone merged commit 83002d7 into RustPython:main May 4, 2026
22 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants