optimise simdutf for short strings#926
Conversation
|
Benchmark for reference - cc: @lemire |
| #endif | ||
|
|
||
| #ifndef SIMDUTF_SHORT_INPUT_THRESH | ||
| #define SIMDUTF_SHORT_INPUT_THRESH 16 |
There was a problem hiding this comment.
how was this picked? it seems lower than both boundaries linked in #925
There was a problem hiding this comment.
When processing short inputs (e.g., strings smaller than 16 bytes),..
Sorry for that, I was considering @lemire's statement in the issue, will look into the referenced links. Thanks for pointing out.
|
@sleepingeight The design is clean, but I am not sure that this will work. The problem is that the function call is still not inline-able and most of our functions already branch on small inputs (often by necessity!!!) so according to my mental model, the branch you added in the source function buys you little and might even cost you a bit. simdutf_warn_unused bool validate_utf8(const char *buf, size_t len) noexcept {
if (len < SIMDUTF_SHORT_INPUT_THRESH) {
return scalar::utf8::validate(buf, len);
}
/*
* You may think that the get_default_implementation() is costly, but it is not. On many systems where
* runtime dispatching is unneeded (such as ARM), this is free. On x64, it is essentially just a guarded
* pointer access. If you run profiling data, I don't think it will come up.
* */
return get_default_implementation()->validate_utf8(buf, len);
}UPDATE (see below): In at least some cases, I am wrong and the approach pursued here appears to be justified. |
|
I created a tool to help us with this problem, please go review: My tool seems to (at least partially) validate your approach actually. I added the following line... if (len < 32) {
return scalar::utf8::validate(buf, len);
}On a macbook, I get... So, contrary to what I wrote above, in at least one case, your approach helps tremendously. This is good news because your approach is so clean and simple. |
|
We will need to test this on different systems. :-) |
|
Thank you for the benchmark. Are the results shown above for the new code or the old? Asking since I received very good speedups for small inputs on Macbook M1 Pro, also I did not get a big jump for input sizes > 32. Benchmarks on Macbook M1 Pro (arm), New - Will also test this on a x86 machine asap. |
|
Results on Intel(R) Core i9-14900HX (Haswell, x86_64) - Old - New - Performing better than the old code but not as performant as on macbook m1 pro (above results). |
|
@lemire, I've tested another approach by overloading the function inline const implementation *get_default_implementation(size_t len) {
const implementation* opt[2] = {internal::get_fallback_singleton(), internal::get_single_implementation()};
return opt[(len > 32)];
}I've modified macros at other places so that the above overload works. So, the simdutf_warn_unused bool validate_utf8(const char *buf, size_t len) noexcept {
return get_default_implementation(len)->validate_utf8(buf, len);
}Performance of this is less than the approach above but greater than the original. Few numbers for ref - Benchmarks on Macbook M1 Pro (arm) Which approach do you recommend? |
|
@sleepingeight Your current PR is likely the right approach (although I was skeptical at first).
We probably don't want to do something so complicated.
Neither. I manually modified the code from our main branch and got these results. We need to test a lot of functions on a lot of different systems. The challenge is that we need to make sure that it works well for many systems. This will require data. |
|
So I don't think we want to go so broad without careful analysis. I think we need to do a case-by-case analysis using the new benchmark tools #927 |
|
Do you have any suggestion in mind on how to take this forward? I can test things on arm and haswell processors. |
I think that we want a careful analysis that can identify a case where our performance is less than stellar. And then we want to clearly document a benefit. In other words, we want more of a scalpel. It is just as important for us to improve the performance than to avoid regressions. So we need to be careful. |

Related to #925
Started with a straightforward implementation, modifying the public API's defination in this way,
Which was before,
simdutf/src/implementation.cpp
Lines 1429 to 1431 in bcfbae9
Using the
benchmarkutiltiy, got 0 improvements. But when benchmarked using google-benchmark for input sizes from 1 to 15, these are the results forvalidate_utf8procedure -Old:
New:
Seeing a greater than 2x improvement.