Tighten CPython parity for str format spec, %-format, and str() constructor#7769
Conversation
📝 WalkthroughWalkthroughThis PR makes three orthogonal changes: string formatting now truncates to precision before alignment; the ChangesString Formatting Precision Handling
String Constructor Behavior
Format Conversion CPython Parity
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
📦 Library DependenciesThe following Lib/ modules were modified. Here are their dependencies: [x] test: cpython/Lib/test/test_str.py (TODO: 10) dependencies: dependent tests: (no tests depend on str) [ ] test: cpython/Lib/test/test_descr.py (TODO: 36) dependencies: dependent tests: (no tests depend on descr) Legend:
|
…ructor
Five related CPython parity gaps in `str` formatting and construction:
1. **`str(bytes, errors=...)` triggers decode mode.** Previously, only
`encoding=` triggered decode; passing only `errors=` fell back to
`repr()`. CPython's behavior: presence of `encoding` OR `errors`
triggers decode mode (default UTF-8 when only `errors` is given).
2. **`'{...}'.format() IndexError wording.** Generic Rust "tuple index
out of range" replaced with CPython's "Replacement index N out of
range for positional args tuple".
3. **`{0:3.2s}.format('abc')` → 'ab '.** String format spec applied
precision after width padding; CPython truncates BEFORE padding.
Reorder the operations.
4. **`%x` / `%o` / `%X` / `%c` accept `__index__` objects.** Previously
only `PyInt` downcast was attempted. Mirror CPython's
PyNumber_Index dispatch via `try_index_opt`.
5. **`%d` / `%u` / `%i` error wording.** "a number is required" →
"a real number is required" (matches CPython).
Also adds `not <type>` suffix to `%c` error messages so the type is
visible in TypeError text (matches CPython structure even without
fully-qualified names).
Verified byte-identical with CPython 3.14.4 across 25+ probes covering
the format/spec/constructor combinations. Unmasks
`test_str.test_constructor_keyword_args` and
`test_str.test_constructor_defaults`. test_str/test_bytes/test_format/
test_codecs/test_io/test_unicode_identifiers — 1,429 tests pass, 0
regressions. All 188 `extra_tests/snippets/*.py` pass under the CI
feature set.
`test_str.test_format` and `test_str.test_formatting` markers retained:
`test_format` still trips on `'{0:08s}'.format('result')` (numeric
zero-pad treated as fill+left-align by CPython for str type — separate
format-spec parser concern). `test_formatting` still trips on
`%c` error message expecting fully qualified `module.qualname` (RP
returns bare class name — separate broader concern).
62a1279 to
e4c9aaf
Compare
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
crates/vm/src/cformat.rs (1)
74-85:⚠️ Potential issue | 🟠 MajorSurface
__int__protocol failures instead of the generic error.When
__int__returns a non-int value, Lines 74-80 and 192-198 fall through to the generic%d format: a real number is requirederror. CPython 3.14 instead raisesTypeError: int returned non-int (type <T>)when__int__violates the protocol. Raise the specific protocol error to match CPython's behavior.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/vm/src/cformat.rs` around lines 74 - 85, When calling the __int__ protocol (via vm.get_method(..., identifier!(vm, __int__)) and method?.call(...)) detect when the returned value is not a PyInt and raise a protocol-specific TypeError instead of falling through to the generic "%d format" error; in the block around vm.get_method / identifier!(vm, __int__) (and the analogous block at lines ~192-198) return vm.new_type_error with a message like "int returned non-int (type <T>)" using the returned object's class name rather than proceeding to check_int_to_str_digits or spec.format_number so the violation is surfaced per CPython 3.14 behavior.
🧹 Nitpick comments (1)
crates/vm/src/cformat.rs (1)
67-97: 🏗️ Heavy liftConsider extracting the new coercion paths into shared helpers.
The
__index__/__int__dispatch and%ccoercion logic is now duplicated in bothspec_format_bytes()andspec_format_string(). These are parity-sensitive branches, so keeping one shared implementation would reduce the chance of the two paths drifting again on the next wording or precedence tweak.As per coding guidelines, "When branches differ only in a value but share common logic, extract the differing value first, then call the common logic once to avoid duplicate code".
Also applies to: 120-145, 185-215, 229-251
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/vm/src/cformat.rs` around lines 67 - 97, Extract the duplicated `__index__`/`__int__` coercion and `%c` handling into shared helper(s) and call them from spec_format_bytes() and spec_format_string(): create one helper (e.g., coerce_to_index_or_int(...)) that encapsulates the try_index_opt(vm) branch, the vm.get_method(... identifier!(vm, __int__)) call, downcast to PyInt, check_int_to_str_digits, and returning the BigInt (or an error), and another small helper for the `%c`-specific validation if needed; then replace the duplicated blocks in spec_format_bytes() and spec_format_string() with calls to these helpers and use the returned BigInt with spec.format_number(...) (or the char logic) so both functions share the same parity-sensitive dispatch logic.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@crates/vm/src/cformat.rs`:
- Around line 74-85: When calling the __int__ protocol (via vm.get_method(...,
identifier!(vm, __int__)) and method?.call(...)) detect when the returned value
is not a PyInt and raise a protocol-specific TypeError instead of falling
through to the generic "%d format" error; in the block around vm.get_method /
identifier!(vm, __int__) (and the analogous block at lines ~192-198) return
vm.new_type_error with a message like "int returned non-int (type <T>)" using
the returned object's class name rather than proceeding to
check_int_to_str_digits or spec.format_number so the violation is surfaced per
CPython 3.14 behavior.
---
Nitpick comments:
In `@crates/vm/src/cformat.rs`:
- Around line 67-97: Extract the duplicated `__index__`/`__int__` coercion and
`%c` handling into shared helper(s) and call them from spec_format_bytes() and
spec_format_string(): create one helper (e.g., coerce_to_index_or_int(...)) that
encapsulates the try_index_opt(vm) branch, the vm.get_method(... identifier!(vm,
__int__)) call, downcast to PyInt, check_int_to_str_digits, and returning the
BigInt (or an error), and another small helper for the `%c`-specific validation
if needed; then replace the duplicated blocks in spec_format_bytes() and
spec_format_string() with calls to these helpers and use the returned BigInt
with spec.format_number(...) (or the char logic) so both functions share the
same parity-sensitive dispatch logic.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yml
Review profile: CHILL
Plan: Pro
Run ID: 28b0e721-5781-41f4-b9f1-77fca72a50b0
⛔ Files ignored due to path filters (2)
Lib/test/test_descr.pyis excluded by!Lib/**Lib/test/test_str.pyis excluded by!Lib/**
📒 Files selected for processing (4)
crates/common/src/format.rscrates/vm/src/builtins/str.rscrates/vm/src/cformat.rscrates/vm/src/format.rs
✅ Files skipped from review due to trivial changes (1)
- crates/vm/src/format.rs
🚧 Files skipped from review as they are similar to previous changes (1)
- crates/vm/src/builtins/str.rs
Summary
Eight CPython parity gaps in
strformatting and construction, grouped into three themes.Group 1 —
str()constructor decode modeCPython's
unicode_new_impl(Objects/unicodeobject.c) enters decode mode whenencodingORerrorsis provided, with separate early-reject paths for str input and non-bytes input. RustPython only checkedencoding, and the non-bytes path surfaced a generic "a bytes-like object is required" error.Group 2 —
str.format()spec engineCPython's format spec engine (
Python/formatter_unicode.c) truncates to precision before applying width pad. RustPython applied precision after, so the pad got truncated off. Separately,'{}'.format()with insufficient args wrapped Rust's tuple-index error rather than emitting CPython's domain-specific message.The IndexError wording change covers both auto-numbered and explicit-index paths.
Group 3 —
%-format integer dispatch and wordingCPython's
%-format path (Objects/unicodeobject.c::formatlongand adjacent) dispatches integer conversions throughPyNumber_Index(the__index__protocol), and uses "a real number" / "an int or a unicode character" wording. RustPython only tried__int__for%d/%i/%u, never tried__index__for the integer-conversion error path on%x/%o/%X/%c, and used divergent wording.%d/%i/%unow prefer__index__over__int__(matching CPython), and the "a number" wording becomes "a real number" for the complex/float-as-decimal mismatch. Applied to both bytes and str%-format paths.Verification (CPython 3.14.4)
'{}'.format()IndexError variationsCharLen for String)str.format()keyword + attr accessformat()builtincargo run --release -- -m test test_str test_bytes test_format test_codecs test_io test_unicode_identifiers— 1,429 tests pass, 0 regressions. All 188extra_tests/snippets/*.pypass under the CI feature set.cargo test rustpython-common cformat— 11 unit tests pass (including non-ASCII).Tests unmasked
Lib/test/test_str.py::test_constructor_keyword_argsLib/test/test_str.py::test_constructor_defaultsLib/test/test_descr.py::ClassPropertiesAndMethods.test_keywords— basic-type constructor parity aggregator;str(object=500)andstr(object=b'abc', errors='strict')paths now pass.Out of scope (markers retained, with updated reasons)
test_format—'{0:08s}'.format('result')treats0as fill+left-align in CPython but as numeric zero-pad in RustPython's spec parser. Separate format-spec parser concern.test_formatting—%cerror message expects fully-qualifiedmodule.qualname; RustPython returns the bare class name. Broadertp_namequalification concern.