Support unicode array type. by frank-king · Pull Request #2896 · RustPython/RustPython

frank-king · 2021-08-16T12:38:08Z

youknowone

wow, finally we will get unicode array! great!

coolreader18 · 2021-08-16T23:41:25Z

One thing that's tricky about the 'u' array is that it looks like it's encoded as either UTF-32 or UTF-16 depending on the size of wchar_t: https://github.com/python/cpython/blob/main/Objects/unicodeobject.c#L3251-L3274

One code point could be split over 2 WideChars in the array if its value is > u16::MAX, and as the code is now it just silently truncates the high bits of the value with the ch as _

frank-king · 2021-08-17T01:22:51Z

One thing that's tricky about the 'u' array is that it looks like it's encoded as either UTF-32 or UTF-16 depending on the size of wchar_t: https://github.com/python/cpython/blob/main/Objects/unicodeobject.c#L3251-L3274

One code point could be split over 2 WideChars in the array if its value is > u16::MAX, and as the code is now it just silently truncates the high bits of the value with the ch as _

From the CPython's code, I am a little confused. Should SIZEOF_WCHAR_T == sizeof(wchar_t) always be true?

My understanding is that sizeof(wchar_t) and SIZEOF_WCHAR_T can be either 2 or 4 depending on machines, and the cases are

match (SIZEOF_WCHAR_T, sizeof(wchar_t)) {
    (2, 2) | (4, 4) =>  /* case [1] */,
    (4, 2) => /* case [2]  */
    (2, 4) => /* case [3] */
}

In RustPython, how can we deal with cases that SIZEOF_WCHAR_T != sizeof(wchar_t)? We do not have a macro SIZEOF_WCHAR_T and only have sizeof(wchar_t) from libc.

[1] https://github.com/python/cpython/blob/1512bc21d60f098a9e9f37b44a2f6a9b49a3fd4f/Objects/unicodeobject.c#L3230-L3241

#if USE_UNICODE_WCHAR_CACHE
    const wchar_t *wstr = _PyUnicode_WSTR(unicode);
    if (wstr != NULL) {
        memcpy(w, wstr, size * sizeof(wchar_t));
        return;
    }
#else /* USE_UNICODE_WCHAR_CACHE */
    if (PyUnicode_KIND(unicode) == sizeof(wchar_t)) {
        memcpy(w, PyUnicode_DATA(unicode), size * sizeof(wchar_t));
        return;
    }
#endif /* USE_UNICODE_WCHAR_CACHE *

[2] https://github.com/python/cpython/blob/1512bc21d60f098a9e9f37b44a2f6a9b49a3fd4f/Objects/unicodeobject.c#L3252-L3256

        assert(PyUnicode_KIND(unicode) == PyUnicode_2BYTE_KIND);
        const Py_UCS2 *s = PyUnicode_2BYTE_DATA(unicode);
        for (; size--; ++s, ++w) {
            *w = *s;
        }

[3] https://github.com/python/cpython/blob/1512bc21d60f098a9e9f37b44a2f6a9b49a3fd4f/Objects/unicodeobject.c#L3258-L3273

        assert(PyUnicode_KIND(unicode) == PyUnicode_4BYTE_KIND);
        const Py_UCS4 *s = PyUnicode_4BYTE_DATA(unicode);
        for (; size--; ++s, ++w) {
            Py_UCS4 ch = *s;
            if (ch > 0xFFFF) {
                assert(ch <= MAX_UNICODE);
                /* encode surrogate pair in this case */
                *w++ = Py_UNICODE_HIGH_SURROGATE(ch);
                if (!size--)
                    break;
                *w = Py_UNICODE_LOW_SURROGATE(ch);
            }
            else {
                *w = ch;
            }
        }

coolreader18 · 2021-08-17T04:08:05Z

I don't think it does check sizeof(wchar_t), and SIZEOF_WCHAR_T I believe should always be equivalent - maybe what you're confused on is a missing case, which was from the outer function which calls unicode_copy_as_widechar: https://github.com/python/cpython/blob/main/Objects/unicodeobject.c#L4235-L4267

CPython's string repr is kinda weird, so it's just checking whether the string is internally a Vec<u8>, Vec<u16>, or Vec<u32>. When they're converting to a wide string, it's obviously easy if the string is a Vec<wchar_t>, and a Vec<u8> can just be casted up from u8 -> wchar_t. The special case is converting from u16 to wchar_t if wchar_t == u32 or u32 to wchar_t if wchar_t == u16. The first one is also just an upcasting, the second requires converting to UTF-16.

coolreader18 · 2021-08-17T04:13:21Z

Oh hey it'd actually be not too bad to do UTF-16 - char::decode_utf16 and str::encode_utf16/char::encode_utf16 seem like they'd make this pretty painless.

frank-king · 2021-08-17T09:13:56Z

@coolreader18 Thanks for your explanation, and that's what I've learnt from the CPython's code:

In CPython, a string can be either encoded in UTF-8, USC-1, USC-2 or USC-4 depends on its content.

However, in RustPython, a string is always encoded in UTF-8. (as the PyStr::value's type is Box<str>). So we need no conversions inside the unicode array (Vec<wchar_t>). We only need conversions during initialization (from Box<str>) or the explicit call of fromunicode/tounicode (where the arguments are Box<str>).

I have another question:

If I call 'a.insert(-1, '\U0001FF00') where a is an array of unicode and sizeof(wchar_t) == 2, should we throw exception (and what kind of exception) or just truncate the value from the lower 2 bytes?

I have no such environment that sizeof(wchar_t) == 2 so I cannot test how CPython will behavior. Do you have any ideas?

frank-king · 2021-08-17T10:00:56Z

Oh hey it'd actually be not too bad to do UTF-16 - char::decode_utf16 and str::encode_utf16/char::encode_utf16 seem like they'd make this pretty painless.

Note that CPython splits a USC-4 to two USC-2s, which is different from the behavior of char::encode_utf16. We may not use this function.

Maybe we can add a function like this?

fn to_usc16(ch: char) -> impl Iterator<Item = u16> {
    let ch = ch as u32;
    let mut usc16 = [0_u16; 2];
    if ch > u16::MAX as u32 {
        usc16[0] = (ch >> 16) as u16;
        usc16[1] = ch as u16;
        usc16.into_iter()
    } else {
        usc16[0] = ch as u16;
        usc16[..1].into_iter()
    }
}

Update: my mistake, Py_UNICODE_HIGH_SURROGATE does UTF-16 convertion.

frank-king · 2021-08-17T14:09:04Z

Now all the test cases in UnicodeTest have passed :)

frank-king · 2021-08-17T15:25:32Z

Fix as the clippy tips.

youknowone

Looks great in general. I left a few comments

frank-king · 2021-08-21T00:53:52Z

        self.assertEqual(fp.geturl(), redirected_url.strip())

+    # TODO: RUSTPYTHON
+    @unittest.expectedFailure


There are two tests failed on my computer, so I marked them expectedFailure:

====================================================================== FAIL: test_redirect_encoding (test.test_urllib2.HandlerTests) [b'/spaced path/'] ---------------------------------------------------------------------- Traceback (most recent call last): File ".../vm/pylib-crate/Lib/test/test_urllib2.py", line 1358, in test_redirect_encoding self.assertTrue(request.startswith(expected), repr(request)) AssertionError: False is not true : b'GET http://example.com/spaced%20path/ HTTP/1.1\r\nAccept-Encoding: identity\r\nHost: example.com\r\nUser-Agent: Python-urllib/3.9\r\nConnection: close\r\n\r\n' ====================================================================== FAIL: test_redirect_encoding (test.test_urllib2.HandlerTests) [b'/?p\xc3\xa5-dansk'] ---------------------------------------------------------------------- Traceback (most recent call last): File ".../vm/pylib-crate/Lib/test/test_urllib2.py", line 1358, in test_redirect_encoding self.assertTrue(request.startswith(expected), repr(request)) AssertionError: False is not true : b'GET http://example.com/?p%C3%A5-dansk HTTP/1.1\r\nAccept-Encoding: identity\r\nHost: example.com\r\nUser-Agent: Python-urllib/3.9\r\nConnection: close\r\n\r\n'

But it seems to pass on the GitHub CI.

My OS is Ubuntu 20.04, and the CPU is 11th Gen Intel(R) Core(TM) i5-1135G7 @ 2.40GHz.

My rustc version is 1.53.0, and the test command is

cargo run --release -- -m test -j8 -v

Is there anyone who can reproduce this error?

youknowone · 2021-08-21T13:26:27Z

Thank you for contributing!

frank-king force-pushed the support_unicode_array branch from 5f660a4 to 4ddcda1 Compare August 16, 2021 12:40

youknowone requested changes Aug 16, 2021

View reviewed changes

Comment thread vm/src/builtins/pystr.rs Outdated

Comment thread vm/src/stdlib/array.rs Outdated

Comment thread vm/src/stdlib/array.rs Outdated

frank-king force-pushed the support_unicode_array branch from 4ddcda1 to c4dcf8a Compare August 16, 2021 15:46

frank-king requested a review from youknowone August 16, 2021 15:47

frank-king force-pushed the support_unicode_array branch from c4dcf8a to 4d96582 Compare August 17, 2021 14:08

frank-king force-pushed the support_unicode_array branch 2 times, most recently from 893075a to 6c8a6e8 Compare August 17, 2021 15:25

youknowone reviewed Aug 17, 2021

View reviewed changes

Comment thread vm/src/exceptions.rs Outdated

youknowone requested changes Aug 17, 2021

View reviewed changes

Comment thread vm/src/stdlib/array.rs

Comment thread vm/src/stdlib/array.rs Outdated

frank-king commented Aug 17, 2021

View reviewed changes

Comment thread vm/src/builtins/memory.rs Outdated

frank-king force-pushed the support_unicode_array branch from 6c8a6e8 to f283a2b Compare August 20, 2021 16:24

frank-king requested a review from youknowone August 20, 2021 16:29

frank-king force-pushed the support_unicode_array branch from f283a2b to 1db04df Compare August 20, 2021 16:32

frank-king commented Aug 21, 2021

View reviewed changes

Comment thread vm/src/stdlib/array.rs Outdated

frank-king force-pushed the support_unicode_array branch from 1db04df to c7a193e Compare August 21, 2021 10:21

Support unicode array type.

e652ae8

frank-king force-pushed the support_unicode_array branch from c7a193e to e652ae8 Compare August 21, 2021 12:18

youknowone approved these changes Aug 21, 2021

View reviewed changes

youknowone merged commit 7868070 into RustPython:master Aug 21, 2021

frank-king deleted the support_unicode_array branch August 21, 2021 15:13

Conversation

frank-king commented Aug 16, 2021

Uh oh!

youknowone left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coolreader18 commented Aug 16, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

frank-king commented Aug 17, 2021

Uh oh!

coolreader18 commented Aug 17, 2021

Uh oh!

coolreader18 commented Aug 17, 2021

Uh oh!

frank-king commented Aug 17, 2021

Uh oh!

frank-king commented Aug 17, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

frank-king commented Aug 17, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

frank-king commented Aug 17, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

youknowone left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

frank-king Aug 21, 2021

Choose a reason for hiding this comment

Uh oh!

Uh oh!

youknowone commented Aug 21, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

coolreader18 commented Aug 16, 2021 •

edited

Loading

frank-king commented Aug 17, 2021 •

edited

Loading

frank-king commented Aug 17, 2021 •

edited

Loading

frank-king commented Aug 17, 2021 •

edited

Loading