Skip to content

Conversation

@anonrig
Copy link
Member

@anonrig anonrig commented Oct 31, 2025

small experiment with v8::String::ValueView and simdutf for TextEncoder::encode method.

@anonrig anonrig requested review from a team as code owners October 31, 2025 14:50
@codspeed-hq
Copy link

codspeed-hq bot commented Oct 31, 2025

CodSpeed Performance Report

Merging #5448 will improve performances by ×7.4

Comparing yagiz/experiment-value-view (cbe01b6) with main (df1cafa)

Summary

⚡ 11 improvements
✅ 23 untouched
⏩ 9 skipped1

Benchmarks breakdown

Benchmark BASE HEAD Change
simpleStringBody[Response] 22.4 µs 19.9 µs +12.68%
Encode_ASCII_1024[0/0/1024] 8.9 ms 3 ms ×3
Encode_ASCII_256[0/0/256] 4.4 ms 2.5 ms +77.75%
Encode_ASCII_32[0/0/32] 3.1 ms 2.3 ms +35.72%
Encode_ASCII_8192[0/0/8192] 55.6 ms 7.5 ms ×7.4
Encode_OneByte_1024[0/1/1024] 11.9 ms 4.8 ms ×2.5
Encode_OneByte_256[0/1/256] 5.1 ms 3.3 ms +54.64%
Encode_OneByte_8192[0/1/8192] 84.7 ms 18.7 ms ×4.5
Encode_TwoByte_1024[0/2/1024] 19.5 ms 5 ms ×3.9
Encode_TwoByte_256[0/2/256] 7 ms 3.2 ms ×2.2
Encode_TwoByte_8192[0/2/8192] 146.1 ms 23.5 ms ×6.2

Footnotes

  1. 9 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.

@anonrig anonrig force-pushed the yagiz/experiment-value-view branch from 30c0921 to 2259fce Compare October 31, 2025 17:18
@anonrig anonrig changed the title experiment with value view and simdutf improve text encoder encode performance Oct 31, 2025
@jasnell
Copy link
Collaborator

jasnell commented Nov 3, 2025

Looks like there are still some bugs to work out here with the failing CI... but even then, my preference would be to settle on #5449 first before landing any changes here. Also, since we do utf8 conversions everywhere, not just in TextEncoder::encode, my preference would be to address this more generally. That is, the optimized encoding path -- if we decide it's worthwhile -- should go into jsg::JsString so that it can be used everywhere rather than just in TextEncoder::encode.

@anonrig anonrig force-pushed the yagiz/experiment-value-view branch from b1ef7c8 to 98afb46 Compare November 3, 2025 17:10
@anonrig anonrig force-pushed the yagiz/experiment-value-view branch from 98afb46 to f1bbfe6 Compare November 3, 2025 17:15
@anonrig anonrig force-pushed the yagiz/experiment-value-view branch from 7fac631 to 681bf71 Compare November 3, 2025 17:50
@anonrig anonrig force-pushed the yagiz/experiment-value-view branch 2 times, most recently from dfb1c39 to 3a6ea76 Compare November 3, 2025 18:00
@anonrig anonrig force-pushed the yagiz/experiment-value-view branch from 3a6ea76 to 6e3972e Compare November 3, 2025 20:04
@erikcorry
Copy link
Contributor

I don't think we need to speed optimize for the broken UTF-16 case unless and until someone shows it matters. The only reason to space-optimize would be to avoid throwing OOM, so if we can guarantee that doesn't happen I'm OK with some temporary blowup in space too.

@erikcorry
Copy link
Contributor

https://paste.cfdata.org/GKKdyFGFqSks

@anonrig
Copy link
Member Author

anonrig commented Nov 3, 2025

https://paste.cfdata.org/GKKdyFGFqSks

This helps a lot. I'll push with your changes. Thanks @erikcorry


// Calculate UTF-8 length from UTF-16 with potentially invalid surrogates.
// Invalid surrogates are counted as U+FFFD (3 bytes in UTF-8).
size_t utf8LengthFromInvalidUtf16(const char16_t* input, size_t length) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mentioned this in another comment that I believe got resolved somewhere... this really does not belong in encoding.c++. Any improvement we make should be usable by the entire runtime. Let's move the optimization into jsg::JsString so that all of the APIs benefit.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll do this last. I want to make sure we don't regress with your recommendations first.

KJ_DASSERT(utf8_length == written);
auto array =
v8::Uint8Array::New(v8::ArrayBuffer::New(js.v8Isolate, backingStore2), 0, utf8_length);
return jsg::JsUint8Array(array);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still creating the intermediate on-heap allocation that can trigger GC. Let's have the intermediate allocation (the one that writeOneByte writes into) be an off-heap (kj::SmallArray<kj::byte, 4096> buffer(length)) allocation. In both cases we end up copying into the on-heap destination buffer but we avoid the wasted additional on-heap allocation in the second case.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this, on happy path (length === utf8_length), we do a memcpy here. I'm not sure if this is better. @erikcorry what do you think?

if (utf8_length == length) {
      // ASCII fast path: copy to on-heap BackingStore once
      backingStore = js.allocBackingStore(length, jsg::Lock::AllocOption::UNINITIALIZED);
      memcpy(backingStore->Data(), buffer.begin(), length);
      auto array = v8::Uint8Array::New(v8::ArrayBuffer::New(js.v8Isolate, backingStore), 0, length);
      return jsg::JsUint8Array(array);
    }
    ```

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not about the memcpy... on line 601 you allocate a v8::BackingStore that receives the results of the WriteOneByte, then here in this branch you create a second one that receives the results of the latin1-to-utf8 encoding. What I'm saying here is that you should only create one v8::BackingStore that represents the final result. The WriteOneByte should write into an off-v8-heap allocation (e.g. a kj::Array). This way you greatly lessen the risk of triggering a GC in the middle.

// Two-byte string path
{
// Note that ValueView flattens the string, if it's not already flattened
v8::String::ValueView view(js.v8Isolate, str);
Copy link
Collaborator

@jasnell jasnell Nov 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that ValueView forces flattening, which we need to avoid... let's instead use WriteV2 to write to an off-heap allocation (e.g. kj::SmallArray<uint16_t, 4096> buffer(length)) to avoid the additional GC pressure caused by flattening. In this case, it's perfectly fine to give up some CPU for the copy in order to avoid the higher cost of flattening.

if (str.isOneByte(js)) {
auto length = str.length(js);
// Allocate buffer for Latin-1. Use v8::ArrayBuffer::NewBackingStore to avoid creating
// JS objects during conversion.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, while this may not create JS objects it can and will trigger GC.. so let's try to do this (allocating the on-heap result buffer) once per call. All other allocations should be off-heap.


// Convert UTF-16 with potentially invalid surrogates to UTF-8.
// Invalid surrogates are replaced with U+FFFD.
void convertInvalidUtf16ToUtf8(const char16_t* input, size_t length, char* out) {
Copy link
Collaborator

@jasnell jasnell Nov 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might be able to simplify things further by combining this and the utf8LengthFromInvalidUtf16 function into one. Return the ut8 length, but also replace the invalid code units with the replacement char in-place, then apply the encoding which, if the scan was correct, should never have to worry about the unpaired surrogates. It would save an additional linear scan.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This won't work because we need to have a separate copy in order to update it. The original input is a const char*. we can't fix and calculate length on 1 linear scan.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original input does not need to be const tho, that's under our control. If you modify it like this then you can simplify the next transcoding step since you'll know the unpaired surrogates have already been replaced and you don't need to handle them again:

  size_t utf8LengthFromInvalidUtf16(kj::ArrayPtr<char16_t> input) {
    size_t utf8Length = 0;
    bool pendingSurrogate = false;

    for (size_t i = 0; i < input.size(); i++) {
      char16_t c = input[i];

      if (pendingSurrogate) {
        if (isTrailSurrogate(c)) {
          // Valid surrogate pair = 4 bytes in UTF-8
          utf8Length += 4;
          pendingSurrogate = false;
        } else {
          // Unpaired lead surrogate at position i-1 - replace it
          input[i - 1] = 0xFFFD;
          utf8Length += 3;  // U+FFFD is always 3 bytes in UTF-8

          // Now process current character (which is not a trail surrogate)
          if (isLeadSurrogate(c)) {
            // Current char becomes the new pending lead surrogate
            pendingSurrogate = true;
          } else {
            // Regular character
            utf8Length += utf8BytesForCodeUnit(c);
            pendingSurrogate = false;
          }
        }
      } else if (isLeadSurrogate(c)) {
        pendingSurrogate = true;
      } else if (isTrailSurrogate(c)) {
        // Unpaired trail surrogate - replace it
        input[i] = 0xFFFD;
        utf8Length += 3;  // U+FFFD is always 3 bytes in UTF-8
      } else {
        utf8Length += utf8BytesForCodeUnit(c);
      }
    }

    if (pendingSurrogate) {
      // Trailing unpaired lead surrogate - replace it
      input[input.size() - 1] = 0xFFFD;
      utf8Length += 3;  // U+FFFD is always 3 bytes in UTF-8
    }

    return utf8Length;
  }


int hashCode() const;

bool isOneByte(Lock& js) const KJ_WARN_UNUSED_RESULT;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: isOneByte is super cheap. The KJ_WARN_UNUSED_RESULT is probably unnecessary.

return BackingStore(js.allocBackingStore(size), size, 0, getBufferSourceElementSize<T>(),
construct<T>, checkIsIntegerType<T>());
static BackingStore alloc(
Lock& js, size_t size, Lock::AllocOption init_mode = Lock::AllocOption::ZERO_INITIALIZED) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm really not a big fan of adding the init_mode option here. My preference would be to stay on the safer side and always have these allocations be zero-initialized as it makes it significantly less likely that we'll mess up and introduce a critical memory vuln. If we add this we need to be extra careful about how we use this.

backingStore = js.allocBackingStore(length, jsg::Lock::AllocOption::UNINITIALIZED);
auto backingData = reinterpret_cast<kj::byte*>(backingStore->Data());

str.writeOneByte(js, kj::arrayPtr(backingData, length));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simply because I'm paranoid, since you're adding the uninitialized allocation, we should probably have an assert in here that verifies that writeOneByte completely overwrote the backing buffer.

// ASCII fast path: no conversion needed, Latin-1 is same as UTF-8 for ASCII
auto array = v8::Uint8Array::New(v8::ArrayBuffer::New(js.v8Isolate, backingStore), 0, length);
return jsg::JsUint8Array(array);
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to be safe, a KJ_DASSERT that verifies that utf8_length > length would be good here.

auto data = reinterpret_cast<const char16_t*>(view.data16());

if (simdutf::validate_utf16le(data, view.length())) {
// Common case: valid UTF-16, convert directly to UTF-8
Copy link

@ChALkeR ChALkeR Nov 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps link to simdutf/simdutf#688 in comment?

Comment on lines +490 to +517
if (pendingSurrogate) {
if (isTrailSurrogate(c)) {
// Valid surrogate pair = 4 bytes in UTF-8
utf8Length += 4;
pendingSurrogate = false;
} else {
// Unpaired lead surrogate = U+FFFD (3 bytes)
utf8Length += 3;
if (!isLeadSurrogate(c)) {
utf8Length += utf8BytesForCodeUnit(c);
pendingSurrogate = false;
}
}
} else if (isLeadSurrogate(c)) {
pendingSurrogate = true;
} else {
if (isTrailSurrogate(c)) {
// Unpaired trail surrogate = U+FFFD (3 bytes)
utf8Length += 3;
} else {
utf8Length += utf8BytesForCodeUnit(c);
}
}
}

if (pendingSurrogate) {
utf8Length += 3; // Trailing unpaired lead surrogate
}
Copy link

@ChALkeR ChALkeR Nov 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Not an actual suggestion)
Just noting that this whole logic should be identical to the following:

Suggested change
if (pendingSurrogate) {
if (isTrailSurrogate(c)) {
// Valid surrogate pair = 4 bytes in UTF-8
utf8Length += 4;
pendingSurrogate = false;
} else {
// Unpaired lead surrogate = U+FFFD (3 bytes)
utf8Length += 3;
if (!isLeadSurrogate(c)) {
utf8Length += utf8BytesForCodeUnit(c);
pendingSurrogate = false;
}
}
} else if (isLeadSurrogate(c)) {
pendingSurrogate = true;
} else {
if (isTrailSurrogate(c)) {
// Unpaired trail surrogate = U+FFFD (3 bytes)
utf8Length += 3;
} else {
utf8Length += utf8BytesForCodeUnit(c);
}
}
}
if (pendingSurrogate) {
utf8Length += 3; // Trailing unpaired lead surrogate
}
if (isLeadSurrogate(c)) {
utf8Length += 3;
pendingSurrogate = true;
} else if (isTrailSurrogate(c)) {
utf8Length += pendingSurrogate ? 1 : 3;
pendingSurrogate = false;
} else {
utf8Length += utf8BytesForCodeUnit(c);
pendingSurrogate = false;
}
}

Not sure which is more performant (or if that is significant at all)

Copy link

@ChALkeR ChALkeR Nov 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or perhaps some form of:

size_t utf8LengthFromInvalidUtf16(kj::ArrayPtr<const char16_t> input) {
  size_t utf8Length = 0;
  bool pendingSurrogate = false;

  for (size_t i = 0; i < input.size(); i++) {
    char16_t c = input[i];

    if (c < 0xD800 || c > 0xDFFF) {
      utf8Length += utf8BytesForCodeUnit(c);
      pendingSurrogate = false;
    } else if (c < 0xDC00) {
      // Lead surrogate
      utf8Length += 3;
      pendingSurrogate = true;
    } else {
      // Trail surrogate
      utf8Length += pendingSurrogate ? 1 : 3;
      pendingSurrogate = false;
    }
  }

  return utf8Length;
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants