-
Notifications
You must be signed in to change notification settings - Fork 465
improve text encoder encode performance #5448
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
CodSpeed Performance ReportMerging #5448 will improve performances by ×7.4Comparing Summary
Benchmarks breakdown
Footnotes
|
30c0921 to
2259fce
Compare
|
Looks like there are still some bugs to work out here with the failing CI... but even then, my preference would be to settle on #5449 first before landing any changes here. Also, since we do utf8 conversions everywhere, not just in TextEncoder::encode, my preference would be to address this more generally. That is, the optimized encoding path -- if we decide it's worthwhile -- should go into |
b1ef7c8 to
98afb46
Compare
98afb46 to
f1bbfe6
Compare
7fac631 to
681bf71
Compare
dfb1c39 to
3a6ea76
Compare
3a6ea76 to
6e3972e
Compare
|
I don't think we need to speed optimize for the broken UTF-16 case unless and until someone shows it matters. The only reason to space-optimize would be to avoid throwing OOM, so if we can guarantee that doesn't happen I'm OK with some temporary blowup in space too. |
|
This helps a lot. I'll push with your changes. Thanks @erikcorry |
src/workerd/api/encoding.c++
Outdated
|
|
||
| // Calculate UTF-8 length from UTF-16 with potentially invalid surrogates. | ||
| // Invalid surrogates are counted as U+FFFD (3 bytes in UTF-8). | ||
| size_t utf8LengthFromInvalidUtf16(const char16_t* input, size_t length) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mentioned this in another comment that I believe got resolved somewhere... this really does not belong in encoding.c++. Any improvement we make should be usable by the entire runtime. Let's move the optimization into jsg::JsString so that all of the APIs benefit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll do this last. I want to make sure we don't regress with your recommendations first.
| KJ_DASSERT(utf8_length == written); | ||
| auto array = | ||
| v8::Uint8Array::New(v8::ArrayBuffer::New(js.v8Isolate, backingStore2), 0, utf8_length); | ||
| return jsg::JsUint8Array(array); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is still creating the intermediate on-heap allocation that can trigger GC. Let's have the intermediate allocation (the one that writeOneByte writes into) be an off-heap (kj::SmallArray<kj::byte, 4096> buffer(length)) allocation. In both cases we end up copying into the on-heap destination buffer but we avoid the wasted additional on-heap allocation in the second case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With this, on happy path (length === utf8_length), we do a memcpy here. I'm not sure if this is better. @erikcorry what do you think?
if (utf8_length == length) {
// ASCII fast path: copy to on-heap BackingStore once
backingStore = js.allocBackingStore(length, jsg::Lock::AllocOption::UNINITIALIZED);
memcpy(backingStore->Data(), buffer.begin(), length);
auto array = v8::Uint8Array::New(v8::ArrayBuffer::New(js.v8Isolate, backingStore), 0, length);
return jsg::JsUint8Array(array);
}
```
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not about the memcpy... on line 601 you allocate a v8::BackingStore that receives the results of the WriteOneByte, then here in this branch you create a second one that receives the results of the latin1-to-utf8 encoding. What I'm saying here is that you should only create one v8::BackingStore that represents the final result. The WriteOneByte should write into an off-v8-heap allocation (e.g. a kj::Array). This way you greatly lessen the risk of triggering a GC in the middle.
| // Two-byte string path | ||
| { | ||
| // Note that ValueView flattens the string, if it's not already flattened | ||
| v8::String::ValueView view(js.v8Isolate, str); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that ValueView forces flattening, which we need to avoid... let's instead use WriteV2 to write to an off-heap allocation (e.g. kj::SmallArray<uint16_t, 4096> buffer(length)) to avoid the additional GC pressure caused by flattening. In this case, it's perfectly fine to give up some CPU for the copy in order to avoid the higher cost of flattening.
| if (str.isOneByte(js)) { | ||
| auto length = str.length(js); | ||
| // Allocate buffer for Latin-1. Use v8::ArrayBuffer::NewBackingStore to avoid creating | ||
| // JS objects during conversion. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, while this may not create JS objects it can and will trigger GC.. so let's try to do this (allocating the on-heap result buffer) once per call. All other allocations should be off-heap.
src/workerd/api/encoding.c++
Outdated
|
|
||
| // Convert UTF-16 with potentially invalid surrogates to UTF-8. | ||
| // Invalid surrogates are replaced with U+FFFD. | ||
| void convertInvalidUtf16ToUtf8(const char16_t* input, size_t length, char* out) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You might be able to simplify things further by combining this and the utf8LengthFromInvalidUtf16 function into one. Return the ut8 length, but also replace the invalid code units with the replacement char in-place, then apply the encoding which, if the scan was correct, should never have to worry about the unpaired surrogates. It would save an additional linear scan.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This won't work because we need to have a separate copy in order to update it. The original input is a const char*. we can't fix and calculate length on 1 linear scan.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original input does not need to be const tho, that's under our control. If you modify it like this then you can simplify the next transcoding step since you'll know the unpaired surrogates have already been replaced and you don't need to handle them again:
size_t utf8LengthFromInvalidUtf16(kj::ArrayPtr<char16_t> input) {
size_t utf8Length = 0;
bool pendingSurrogate = false;
for (size_t i = 0; i < input.size(); i++) {
char16_t c = input[i];
if (pendingSurrogate) {
if (isTrailSurrogate(c)) {
// Valid surrogate pair = 4 bytes in UTF-8
utf8Length += 4;
pendingSurrogate = false;
} else {
// Unpaired lead surrogate at position i-1 - replace it
input[i - 1] = 0xFFFD;
utf8Length += 3; // U+FFFD is always 3 bytes in UTF-8
// Now process current character (which is not a trail surrogate)
if (isLeadSurrogate(c)) {
// Current char becomes the new pending lead surrogate
pendingSurrogate = true;
} else {
// Regular character
utf8Length += utf8BytesForCodeUnit(c);
pendingSurrogate = false;
}
}
} else if (isLeadSurrogate(c)) {
pendingSurrogate = true;
} else if (isTrailSurrogate(c)) {
// Unpaired trail surrogate - replace it
input[i] = 0xFFFD;
utf8Length += 3; // U+FFFD is always 3 bytes in UTF-8
} else {
utf8Length += utf8BytesForCodeUnit(c);
}
}
if (pendingSurrogate) {
// Trailing unpaired lead surrogate - replace it
input[input.size() - 1] = 0xFFFD;
utf8Length += 3; // U+FFFD is always 3 bytes in UTF-8
}
return utf8Length;
}|
|
||
| int hashCode() const; | ||
|
|
||
| bool isOneByte(Lock& js) const KJ_WARN_UNUSED_RESULT; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: isOneByte is super cheap. The KJ_WARN_UNUSED_RESULT is probably unnecessary.
| return BackingStore(js.allocBackingStore(size), size, 0, getBufferSourceElementSize<T>(), | ||
| construct<T>, checkIsIntegerType<T>()); | ||
| static BackingStore alloc( | ||
| Lock& js, size_t size, Lock::AllocOption init_mode = Lock::AllocOption::ZERO_INITIALIZED) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm really not a big fan of adding the init_mode option here. My preference would be to stay on the safer side and always have these allocations be zero-initialized as it makes it significantly less likely that we'll mess up and introduce a critical memory vuln. If we add this we need to be extra careful about how we use this.
| backingStore = js.allocBackingStore(length, jsg::Lock::AllocOption::UNINITIALIZED); | ||
| auto backingData = reinterpret_cast<kj::byte*>(backingStore->Data()); | ||
|
|
||
| str.writeOneByte(js, kj::arrayPtr(backingData, length)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Simply because I'm paranoid, since you're adding the uninitialized allocation, we should probably have an assert in here that verifies that writeOneByte completely overwrote the backing buffer.
| // ASCII fast path: no conversion needed, Latin-1 is same as UTF-8 for ASCII | ||
| auto array = v8::Uint8Array::New(v8::ArrayBuffer::New(js.v8Isolate, backingStore), 0, length); | ||
| return jsg::JsUint8Array(array); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to be safe, a KJ_DASSERT that verifies that utf8_length > length would be good here.
| auto data = reinterpret_cast<const char16_t*>(view.data16()); | ||
|
|
||
| if (simdutf::validate_utf16le(data, view.length())) { | ||
| // Common case: valid UTF-16, convert directly to UTF-8 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps link to simdutf/simdutf#688 in comment?
| if (pendingSurrogate) { | ||
| if (isTrailSurrogate(c)) { | ||
| // Valid surrogate pair = 4 bytes in UTF-8 | ||
| utf8Length += 4; | ||
| pendingSurrogate = false; | ||
| } else { | ||
| // Unpaired lead surrogate = U+FFFD (3 bytes) | ||
| utf8Length += 3; | ||
| if (!isLeadSurrogate(c)) { | ||
| utf8Length += utf8BytesForCodeUnit(c); | ||
| pendingSurrogate = false; | ||
| } | ||
| } | ||
| } else if (isLeadSurrogate(c)) { | ||
| pendingSurrogate = true; | ||
| } else { | ||
| if (isTrailSurrogate(c)) { | ||
| // Unpaired trail surrogate = U+FFFD (3 bytes) | ||
| utf8Length += 3; | ||
| } else { | ||
| utf8Length += utf8BytesForCodeUnit(c); | ||
| } | ||
| } | ||
| } | ||
|
|
||
| if (pendingSurrogate) { | ||
| utf8Length += 3; // Trailing unpaired lead surrogate | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Not an actual suggestion)
Just noting that this whole logic should be identical to the following:
| if (pendingSurrogate) { | |
| if (isTrailSurrogate(c)) { | |
| // Valid surrogate pair = 4 bytes in UTF-8 | |
| utf8Length += 4; | |
| pendingSurrogate = false; | |
| } else { | |
| // Unpaired lead surrogate = U+FFFD (3 bytes) | |
| utf8Length += 3; | |
| if (!isLeadSurrogate(c)) { | |
| utf8Length += utf8BytesForCodeUnit(c); | |
| pendingSurrogate = false; | |
| } | |
| } | |
| } else if (isLeadSurrogate(c)) { | |
| pendingSurrogate = true; | |
| } else { | |
| if (isTrailSurrogate(c)) { | |
| // Unpaired trail surrogate = U+FFFD (3 bytes) | |
| utf8Length += 3; | |
| } else { | |
| utf8Length += utf8BytesForCodeUnit(c); | |
| } | |
| } | |
| } | |
| if (pendingSurrogate) { | |
| utf8Length += 3; // Trailing unpaired lead surrogate | |
| } | |
| if (isLeadSurrogate(c)) { | |
| utf8Length += 3; | |
| pendingSurrogate = true; | |
| } else if (isTrailSurrogate(c)) { | |
| utf8Length += pendingSurrogate ? 1 : 3; | |
| pendingSurrogate = false; | |
| } else { | |
| utf8Length += utf8BytesForCodeUnit(c); | |
| pendingSurrogate = false; | |
| } | |
| } |
Not sure which is more performant (or if that is significant at all)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or perhaps some form of:
size_t utf8LengthFromInvalidUtf16(kj::ArrayPtr<const char16_t> input) {
size_t utf8Length = 0;
bool pendingSurrogate = false;
for (size_t i = 0; i < input.size(); i++) {
char16_t c = input[i];
if (c < 0xD800 || c > 0xDFFF) {
utf8Length += utf8BytesForCodeUnit(c);
pendingSurrogate = false;
} else if (c < 0xDC00) {
// Lead surrogate
utf8Length += 3;
pendingSurrogate = true;
} else {
// Trail surrogate
utf8Length += pendingSurrogate ? 1 : 3;
pendingSurrogate = false;
}
}
return utf8Length;
}
small experiment with v8::String::ValueView and simdutf for TextEncoder::encode method.