5

I recognized that int(unicode_string) sometimes gives obscure results. E.g. int('᪐᭒') == 2.

>>> bytes('᪐᭒', 'utf-8')
b'\xe1\xaa\x90\xe1\xad\x92'
>>> [f'U+{ord(c):04X}' for c in '᪐᭒']
['U+1A90', 'U+1B52']

My expectation would be it fails, because the string does not contain a number.

Is there some explanation for this behaviour?

6
  • 1
    I do not understand. Are you sure the characters have no numeric value (Unicode has a list of numbers in different scripts, and the value), and if would be nice if you write the codepoints of the characters. Easier to look. Commented Aug 14 at 13:13
  • Isn't the byte representation exactly what you want? If I read you correctly I may have hit a number 2 in one of the >200 languages of this planet? Does that also mean that a simple int(...) does scan all possible number representations of all languages? Would be a severe performance issue, I think... Commented Aug 14 at 13:16
  • 3
    They're both in the Decimal Number class - Tai Tham Tham Digit Zero and Balinese Digit Two is effectively int('02') -> 2. Commented Aug 14 at 13:18
  • "Nobody" use byte representation, but we can derive from it. Just usually we use U+xxxx, and it is easier also to look (e.g. Wikipedia), you can get it with [f"U+{ord(c):04X}" for c in '᪐᭒'] Commented Aug 14 at 13:19
  • Ok, I added the U representation for completeness. Commented Aug 14 at 13:32

1 Answer 1

6

...the string does not contain a number.

The two characters you show are numbers; they're both in the decimal number class, and per Python's documentation:

The values 0–9 can be represented by any Unicode decimal digit.

Specifically:

So int('᪐᭒') is effectively int('02'), which is indeed 2.


Does that also mean that a simple int(...) [scans] all possible number representations of all languages?

In CPython, when you create PyLong_FromUnicodeObject, it first uses _PyUnicode_TransformDecimalAndSpaceToASCII which:

Converts a Unicode object holding a decimal value to an ASCII string for using in int, float and complex parsers.
Transforms code points that have decimal digit property to the corresponding ASCII digit code points.
Transforms spaces to ASCII.
Transforms code points starting from the first non-ASCII code point that is neither a decimal digit nor a space to the end into '?'.

So you're getting something a little like:

>>> import unicodedata
>>> [unicodedata.digit(c, "?") for c in "᪐᭒"]
[0, 2]

prior to being parsed in the specified base. So it's not so much "scans all possible number representations" as looks up each character in the input to see if it's considered a digit - if so, part of the information in the Unicode properties is which digit that character represents.

Sign up to request clarification or add additional context in comments.

3 Comments

compart.com (your choice of link) doesn't provide an image of a glyph for either of those Unicode characters. codepoints.net (and others) provides an image, which lets you see what it actually looks like, without the font installed.
I see that this answer solves my question. However, I'm really shocked of the performance impact of this when hundreds of number representations are considered instead of just 10. Of course this is a decision made by the python developers and not the application developers.
this is one of the things that makes Python 3.x slower than 2.x, yes. "it's slower because all strings are Unicode" is the explain-like-I'm-five version. "there are hundreds of representations of digits" explains one way in which all strings being Unicode makes Python 3.x slower. In practice, the Unicode digit codepoints are going to be stored in a data structure that makes testing for digits as performant as possible, but it's still going to be slower than 48 <= ord(c) < 58

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.