Skip to content

Support unicode array type.#2896

Merged
youknowone merged 1 commit into
RustPython:masterfrom
frank-king:support_unicode_array
Aug 21, 2021
Merged

Support unicode array type.#2896
youknowone merged 1 commit into
RustPython:masterfrom
frank-king:support_unicode_array

Conversation

@frank-king

Copy link
Copy Markdown
Contributor

fix #2895

@frank-king frank-king force-pushed the support_unicode_array branch from 5f660a4 to 4ddcda1 Compare August 16, 2021 12:40

@youknowone youknowone left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wow, finally we will get unicode array! great!

Comment thread vm/src/builtins/pystr.rs Outdated
Comment thread vm/src/stdlib/array.rs Outdated
Comment thread vm/src/stdlib/array.rs Outdated
@frank-king frank-king force-pushed the support_unicode_array branch from 4ddcda1 to c4dcf8a Compare August 16, 2021 15:46
@frank-king frank-king requested a review from youknowone August 16, 2021 15:47
@coolreader18

coolreader18 commented Aug 16, 2021

Copy link
Copy Markdown
Member

One thing that's tricky about the 'u' array is that it looks like it's encoded as either UTF-32 or UTF-16 depending on the size of wchar_t: https://github.com/python/cpython/blob/main/Objects/unicodeobject.c#L3251-L3274

One code point could be split over 2 WideChars in the array if its value is > u16::MAX, and as the code is now it just silently truncates the high bits of the value with the ch as _

@frank-king

Copy link
Copy Markdown
Contributor Author

One thing that's tricky about the 'u' array is that it looks like it's encoded as either UTF-32 or UTF-16 depending on the size of wchar_t: https://github.com/python/cpython/blob/main/Objects/unicodeobject.c#L3251-L3274

One code point could be split over 2 WideChars in the array if its value is > u16::MAX, and as the code is now it just silently truncates the high bits of the value with the ch as _

From the CPython's code, I am a little confused. Should SIZEOF_WCHAR_T == sizeof(wchar_t) always be true?

My understanding is that sizeof(wchar_t) and SIZEOF_WCHAR_T can be either 2 or 4 depending on machines, and the cases are

match (SIZEOF_WCHAR_T, sizeof(wchar_t)) {
    (2, 2) | (4, 4) =>  /* case [1] */,
    (4, 2) => /* case [2]  */
    (2, 4) => /* case [3] */
}

In RustPython, how can we deal with cases that SIZEOF_WCHAR_T != sizeof(wchar_t)? We do not have a macro SIZEOF_WCHAR_T and only have sizeof(wchar_t) from libc.

[1] https://github.com/python/cpython/blob/1512bc21d60f098a9e9f37b44a2f6a9b49a3fd4f/Objects/unicodeobject.c#L3230-L3241

#if USE_UNICODE_WCHAR_CACHE
    const wchar_t *wstr = _PyUnicode_WSTR(unicode);
    if (wstr != NULL) {
        memcpy(w, wstr, size * sizeof(wchar_t));
        return;
    }
#else /* USE_UNICODE_WCHAR_CACHE */
    if (PyUnicode_KIND(unicode) == sizeof(wchar_t)) {
        memcpy(w, PyUnicode_DATA(unicode), size * sizeof(wchar_t));
        return;
    }
#endif /* USE_UNICODE_WCHAR_CACHE *

[2] https://github.com/python/cpython/blob/1512bc21d60f098a9e9f37b44a2f6a9b49a3fd4f/Objects/unicodeobject.c#L3252-L3256

        assert(PyUnicode_KIND(unicode) == PyUnicode_2BYTE_KIND);
        const Py_UCS2 *s = PyUnicode_2BYTE_DATA(unicode);
        for (; size--; ++s, ++w) {
            *w = *s;
        }

[3] https://github.com/python/cpython/blob/1512bc21d60f098a9e9f37b44a2f6a9b49a3fd4f/Objects/unicodeobject.c#L3258-L3273

        assert(PyUnicode_KIND(unicode) == PyUnicode_4BYTE_KIND);
        const Py_UCS4 *s = PyUnicode_4BYTE_DATA(unicode);
        for (; size--; ++s, ++w) {
            Py_UCS4 ch = *s;
            if (ch > 0xFFFF) {
                assert(ch <= MAX_UNICODE);
                /* encode surrogate pair in this case */
                *w++ = Py_UNICODE_HIGH_SURROGATE(ch);
                if (!size--)
                    break;
                *w = Py_UNICODE_LOW_SURROGATE(ch);
            }
            else {
                *w = ch;
            }
        }

@coolreader18

Copy link
Copy Markdown
Member

I don't think it does check sizeof(wchar_t), and SIZEOF_WCHAR_T I believe should always be equivalent - maybe what you're confused on is a missing case, which was from the outer function which calls unicode_copy_as_widechar: https://github.com/python/cpython/blob/main/Objects/unicodeobject.c#L4235-L4267

CPython's string repr is kinda weird, so it's just checking whether the string is internally a Vec<u8>, Vec<u16>, or Vec<u32>. When they're converting to a wide string, it's obviously easy if the string is a Vec<wchar_t>, and a Vec<u8> can just be casted up from u8 -> wchar_t. The special case is converting from u16 to wchar_t if wchar_t == u32 or u32 to wchar_t if wchar_t == u16. The first one is also just an upcasting, the second requires converting to UTF-16.

@coolreader18

Copy link
Copy Markdown
Member

Oh hey it'd actually be not too bad to do UTF-16 - char::decode_utf16 and str::encode_utf16/char::encode_utf16 seem like they'd make this pretty painless.

@frank-king

Copy link
Copy Markdown
Contributor Author

@coolreader18 Thanks for your explanation, and that's what I've learnt from the CPython's code:

In CPython, a string can be either encoded in UTF-8, USC-1, USC-2 or USC-4 depends on its content.

However, in RustPython, a string is always encoded in UTF-8. (as the PyStr::value's type is Box<str>). So we need no conversions inside the unicode array (Vec<wchar_t>). We only need conversions during initialization (from Box<str>) or the explicit call of fromunicode/tounicode (where the arguments are Box<str>).

I have another question:

If I call 'a.insert(-1, '\U0001FF00') where a is an array of unicode and sizeof(wchar_t) == 2, should we throw exception (and what kind of exception) or just truncate the value from the lower 2 bytes?

I have no such environment that sizeof(wchar_t) == 2 so I cannot test how CPython will behavior. Do you have any ideas?

@frank-king

frank-king commented Aug 17, 2021

Copy link
Copy Markdown
Contributor Author

Oh hey it'd actually be not too bad to do UTF-16 - char::decode_utf16 and str::encode_utf16/char::encode_utf16 seem like they'd make this pretty painless.

Note that CPython splits a USC-4 to two USC-2s, which is different from the behavior of char::encode_utf16. We may not use this function.

Maybe we can add a function like this?

fn to_usc16(ch: char) -> impl Iterator<Item = u16> {
    let ch = ch as u32;
    let mut usc16 = [0_u16; 2];
    if ch > u16::MAX as u32 {
        usc16[0] = (ch >> 16) as u16;
        usc16[1] = ch as u16;
        usc16.into_iter()
    } else {
        usc16[0] = ch as u16;
        usc16[..1].into_iter()
    }
}

Update: my mistake, Py_UNICODE_HIGH_SURROGATE does UTF-16 convertion.

@frank-king frank-king force-pushed the support_unicode_array branch from c4dcf8a to 4d96582 Compare August 17, 2021 14:08
@frank-king

frank-king commented Aug 17, 2021

Copy link
Copy Markdown
Contributor Author

Now all the test cases in UnicodeTest have passed :)

@frank-king frank-king force-pushed the support_unicode_array branch 2 times, most recently from 893075a to 6c8a6e8 Compare August 17, 2021 15:25
@frank-king

frank-king commented Aug 17, 2021

Copy link
Copy Markdown
Contributor Author

Fix as the clippy tips.

Comment thread vm/src/exceptions.rs Outdated

@youknowone youknowone left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great in general. I left a few comments

Comment thread vm/src/stdlib/array.rs
Comment thread vm/src/stdlib/array.rs Outdated
Comment thread vm/src/builtins/memory.rs Outdated
@frank-king frank-king force-pushed the support_unicode_array branch from 6c8a6e8 to f283a2b Compare August 20, 2021 16:24
@frank-king frank-king requested a review from youknowone August 20, 2021 16:29
@frank-king frank-king force-pushed the support_unicode_array branch from f283a2b to 1db04df Compare August 20, 2021 16:32
Comment thread Lib/test/test_urllib2.py Outdated
self.assertEqual(fp.geturl(), redirected_url.strip())

# TODO: RUSTPYTHON
@unittest.expectedFailure

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two tests failed on my computer, so I marked them expectedFailure:

======================================================================
FAIL: test_redirect_encoding (test.test_urllib2.HandlerTests) [b'/spaced path/']
----------------------------------------------------------------------
Traceback (most recent call last):
  File ".../vm/pylib-crate/Lib/test/test_urllib2.py", line 1358, in test_redirect_encoding
    self.assertTrue(request.startswith(expected), repr(request))
AssertionError: False is not true : b'GET http://example.com/spaced%20path/ HTTP/1.1\r\nAccept-Encoding: identity\r\nHost: example.com\r\nUser-Agent: Python-urllib/3.9\r\nConnection: close\r\n\r\n'

======================================================================
FAIL: test_redirect_encoding (test.test_urllib2.HandlerTests) [b'/?p\xc3\xa5-dansk']
----------------------------------------------------------------------
Traceback (most recent call last):
  File ".../vm/pylib-crate/Lib/test/test_urllib2.py", line 1358, in test_redirect_encoding
    self.assertTrue(request.startswith(expected), repr(request))
AssertionError: False is not true : b'GET http://example.com/?p%C3%A5-dansk HTTP/1.1\r\nAccept-Encoding: identity\r\nHost: example.com\r\nUser-Agent: Python-urllib/3.9\r\nConnection: close\r\n\r\n'

But it seems to pass on the GitHub CI.

My OS is Ubuntu 20.04, and the CPU is 11th Gen Intel(R) Core(TM) i5-1135G7 @ 2.40GHz.

My rustc version is 1.53.0, and the test command is

cargo run --release -- -m test -j8 -v

Is there anyone who can reproduce this error?

Comment thread vm/src/stdlib/array.rs Outdated
@frank-king frank-king force-pushed the support_unicode_array branch from 1db04df to c7a193e Compare August 21, 2021 10:21
@frank-king frank-king force-pushed the support_unicode_array branch from c7a193e to e652ae8 Compare August 21, 2021 12:18
@youknowone youknowone merged commit 7868070 into RustPython:master Aug 21, 2021
@youknowone

Copy link
Copy Markdown
Member

Thank you for contributing!

@frank-king frank-king deleted the support_unicode_array branch August 21, 2021 15:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

String index not correct

3 participants