Conversion to int with Unicode strings

Question

I recognized that int(unicode_string) sometimes gives obscure results. E.g. int('᪐᭒') == 2.

>>> bytes('᪐᭒', 'utf-8')
b'\xe1\xaa\x90\xe1\xad\x92'
>>> [f'U+{ord(c):04X}' for c in '᪐᭒']
['U+1A90', 'U+1B52']

My expectation would be it fails, because the string does not contain a number.

Is there some explanation for this behaviour?

I do not understand. Are you sure the characters have no numeric value (Unicode has a list of numbers in different scripts, and the value), and if would be nice if you write the codepoints of the characters. Easier to look. — Giacomo Catenazzi
– Giacomo Catenazzi, Commented Aug 14 at 13:13
Isn't the byte representation exactly what you want? If I read you correctly I may have hit a number 2 in one of the >200 languages of this planet? Does that also mean that a simple int(...) does scan all possible number representations of all languages? Would be a severe performance issue, I think... — Wör Du Schnaffzig
– Wör Du Schnaffzig, Commented Aug 14 at 13:16
They're both in the Decimal Number class - Tai Tham Tham Digit Zero and Balinese Digit Two is effectively int('02') -> 2. — jonrsharpe
– jonrsharpe, Commented Aug 14 at 13:18
"Nobody" use byte representation, but we can derive from it. Just usually we use U+xxxx, and it is easier also to look (e.g. Wikipedia), you can get it with [f"U+{ord(c):04X}" for c in '᪐᭒'] — Giacomo Catenazzi
– Giacomo Catenazzi, Commented Aug 14 at 13:19

jonrsharpe · Accepted Answer · 2025-08-14 14:02:51Z

6

...the string does not contain a number.

The two characters you show are numbers; they're both in the decimal number class, and per Python's documentation:

The values 0–9 can be represented by any Unicode decimal digit.

Specifically:

᪐ is Tai Tham Tham Digit Zero
᭒ is Balinese Digit Two

So int('᪐᭒') is effectively int('02'), which is indeed 2.

Does that also mean that a simple int(...) [scans] all possible number representations of all languages?

In CPython, when you create PyLong_FromUnicodeObject, it first uses _PyUnicode_TransformDecimalAndSpaceToASCII which:

Converts a Unicode object holding a decimal value to an ASCII string for using in int, float and complex parsers.
Transforms code points that have decimal digit property to the corresponding ASCII digit code points.
Transforms spaces to ASCII.
Transforms code points starting from the first non-ASCII code point that is neither a decimal digit nor a space to the end into '?'.

So you're getting something a little like:

>>> import unicodedata
>>> [unicodedata.digit(c, "?") for c in "᪐᭒"]
[0, 2]

prior to being parsed in the specified base. So it's not so much "scans all possible number representations" as looks up each character in the input to see if it's considered a digit - if so, part of the information in the Unicode properties is which digit that character represents.

edited Aug 14 at 14:02

answered Aug 14 at 13:25

jonrsharpe

123k31 gold badges278 silver badges488 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Wyck Aug 14 at 13:32

compart.com (your choice of link) doesn't provide an image of a glyph for either of those Unicode characters. codepoints.net (and others) provides an image, which lets you see what it actually looks like, without the font installed.

Wör Du Schnaffzig Aug 14 at 13:36

I see that this answer solves my question. However, I'm really shocked of the performance impact of this when hundreds of number representations are considered instead of just 10. Of course this is a decision made by the python developers and not the application developers.

kindall Aug 14 at 16:25

this is one of the things that makes Python 3.x slower than 2.x, yes. "it's slower because all strings are Unicode" is the explain-like-I'm-five version. "there are hundreds of representations of digits" explains one way in which all strings being Unicode makes Python 3.x slower. In practice, the Unicode digit codepoints are going to be stored in a data structure that makes testing for digits as performant as possible, but it's still going to be slower than 48 <= ord(c) < 58

Collectives™ on Stack Overflow

Conversion to int with Unicode strings

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related