Skip to content

implementing the backslash-free string optimization#2596

Open
lemire wants to merge 5 commits intomasterfrom
optimization-get_string
Open

implementing the backslash-free string optimization#2596
lemire wants to merge 5 commits intomasterfrom
optimization-get_string

Conversation

@lemire
Copy link
Member

@lemire lemire commented Jan 22, 2026

This builds on top of a PR by @CarlosEduR. #2211

Fixes #1470

It looks like the proper design to me right now, but we need to run benchmarks to see if it is at least performance neutral.

@camel-cdr
Copy link
Contributor

The check for backslashes seems like a prime candidate for SIMD or SWAR approaches.

@CarlosEduR
Copy link
Member

@lemire nice!
I can run the benchmarks in my local machine tomorrow and share the results.

@lemire
Copy link
Member Author

lemire commented Jan 26, 2026

@camel-cdr Yes, but we don’t want to put it behind runtime dispatching.

@lemire
Copy link
Member Author

lemire commented Jan 26, 2026

@CarlosEduR

I can run the benchmarks in my local machine tomorrow and share the results.

That would be great. The priority is to make sure we do not get any performance regression.

@lemire
Copy link
Member Author

lemire commented Jan 27, 2026

LLVM 16, Apple M4 tests

Benchmark:

./build/benchmark/bench_ondemand --benchmark_filter="find_tweet<simdjson_ondemand>"

Main branch:

find_tweet<simdjson_ondemand>/manual_time     194711 ns       206565 ns         3594 best_bytes_per_sec=3.6592G best_cycles=782.879k best_cycles_per_byte=1.23968 best_docs_per_sec=5.79431k best_frequency=4.53625G best_instructions=4.29687M best_instructions_per_byte=6.80406 best_instructions_per_cycle=5.48855 best_items_per_sec=5.79431k bytes=631.515k bytes_per_second=3.0206Gi/s cycles=788.099k cycles_per_byte=1.24795 docs_per_sec=5.13582k/s frequency=4.04753G/s instructions=4.29737M instructions_per_byte=6.80485 instructions_per_cycle=5.45282 items=1 items_per_second=5.13582k/s [BEST: throughput=  3.66 GB/s doc_throughput=  5794 docs/s instructions=     4296869 cycles=      782879 items=         1 avg_time=    194711 ns]

This PR:

find_tweet<simdjson_ondemand>/manual_time     199841 ns       208790 ns         3569 best_bytes_per_sec=3.65038G best_cycles=784.858k best_cycles_per_byte=1.24282 best_docs_per_sec=5.78035k best_frequency=4.53675G best_instructions=4.30413M best_instructions_per_byte=6.81556 best_instructions_per_cycle=5.48396 best_items_per_sec=5.78035k bytes=631.515k bytes_per_second=2.94306Gi/s cycles=789.668k cycles_per_byte=1.25043 docs_per_sec=5.00397k/s frequency=3.95148G/s instructions=4.30458M instructions_per_byte=6.81628 instructions_per_cycle=5.45113 items=1 items_per_second=5.00397k/s [BEST: throughput=  3.65 GB/s doc_throughput=  5780 docs/s instructions=     4304130 cycles=      784858 items=         1 avg_time=    199841 ns]

Conclusion: Essentially no change.

Benchmark:

./build/benchmark/bench_ondemand --benchmark_filter="partial_tweets<simdjson_ondemand>"

Main:

partial_tweets<simdjson_ondemand>/manual_time    1184151 ns      1195886 ns          590 best_bytes_per_sec=589.604M best_cycles=4.75032M best_cycles_per_byte=7.52211 best_docs_per_sec=933.634 best_frequency=4.43507G best_instructions=20.9986M best_instructions_per_byte=33.2512 best_instructions_per_cycle=4.42046 best_items_per_sec=93.3634k bytes=631.515k bytes_per_second=508.6Mi/s cycles=4.73808M cycles_per_byte=7.50272 docs_per_sec=844.487/s frequency=4.00125G/s instructions=21.0004M instructions_per_byte=33.254 instructions_per_cycle=4.43225 items=100 items_per_second=84.4487k/s [BEST: throughput=  0.59 GB/s doc_throughput=   933 docs/s instructions=    20998634 cycles=     4750323 items=       100 avg_time=   1184151 ns]

This PR:

partial_tweets<simdjson_ondemand>/manual_time    1253094 ns      1263943 ns          561 best_bytes_per_sec=535.447M best_cycles=5.04903M best_cycles_per_byte=7.99511 best_docs_per_sec=847.877 best_frequency=4.28095G best_instructions=21.6601M best_instructions_per_byte=34.2986 best_instructions_per_cycle=4.28995 best_items_per_sec=84.7877k bytes=631.515k bytes_per_second=480.618Mi/s cycles=5.06068M cycles_per_byte=8.01355 docs_per_sec=798.025/s frequency=4.03855G/s instructions=21.663M instructions_per_byte=34.3032 instructions_per_cycle=4.28064 items=100 items_per_second=79.8025k/s [BEST: throughput=  0.54 GB/s doc_throughput=   847 docs/s instructions=    21660103 cycles=     5049029 items=       100 avg_time=   1253094 ns]

Conclusion: a regression

GCC 15, Intel Ice Lake tests

Benchmark:

./build/benchmark/bench_ondemand --benchmark_filter="find_tweet<simdjson_ondemand>"

Main branch:

find_tweet<simdjson_ondemand>/manual_time      82550 ns       131659 ns         6326 best_branch_miss=111 best_bytes_per_sec=9.17833G best_cache_miss=0 best_cache_ref=4.024k best_cycles=213.344k best_cycles_per_byte=0.337829 best_docs_per_sec=14.5338k best_frequency=3.1007G best_instructions=759.93k best_instructions_per_byte=1.20334 best_instructions_per_cycle=3.56199 best_items_per_sec=14.5338k branch_miss=111.766 bytes=631.515k bytes_per_second=7.12471Gi/s cache_miss=8.06197m cache_ref=4.0704k cycles=216.766k cycles_per_byte=0.343247 docs_per_sec=12.1139k/s frequency=2.62587G/s instructions=759.93k instructions_per_byte=1.20334 instructions_per_cycle=3.50577 items=1 items_per_second=12.1139k/s [BEST: throughput=  9.18 GB/s doc_throughput= 14533 docs/s instructions=      759930 cycles=      213344 branch_miss=     111 cache_miss=       0 cache_ref=      4024 items=         1 avg_time=     82549 ns]

This PR:

find_tweet<simdjson_ondemand>/manual_time      93369 ns       150876 ns         6363 best_branch_miss=106 best_bytes_per_sec=9.16022G best_cache_miss=0 best_cache_ref=4.153k best_cycles=213.55k best_cycles_per_byte=0.338155 best_docs_per_sec=14.5052k best_frequency=3.09758G best_instructions=762.109k best_instructions_per_byte=1.20679 best_instructions_per_cycle=3.56876 best_items_per_sec=14.5052k branch_miss=111.936 bytes=631.515k bytes_per_second=6.29917Gi/s cache_miss=0.343863 cache_ref=4.03812k cycles=217.628k cycles_per_byte=0.344612 docs_per_sec=10.7102k/s frequency=2.33085G/s instructions=762.109k instructions_per_byte=1.20679 instructions_per_cycle=3.50189 items=1 items_per_second=10.7102k/s [BEST: throughput=  9.16 GB/s doc_throughput= 14505 docs/s instructions=      762109 cycles=      213550 branch_miss=     106 cache_miss=       0 cache_ref=      4153 items=         1 avg_time=     93368 ns]

Conclusion: Essentially no change.

Benchmark:

./build/benchmark/bench_ondemand --benchmark_filter="partial_tweets<simdjson_ondemand>"

Main:

partial_tweets<simdjson_ondemand>/manual_time     150680 ns       207613 ns         3756 best_branch_miss=227 best_bytes_per_sec=5.38114G best_cache_miss=2 best_cache_ref=4.965k best_cycles=363.34k best_cycles_per_byte=0.575347 best_docs_per_sec=8.52101k best_frequency=3.09602G best_instructions=1.35124M best_instructions_per_byte=2.13968 best_instructions_per_cycle=3.71895 best_items_per_sec=852.101k branch_miss=229.731 bytes=631.515k bytes_per_second=3.90327Gi/s cache_miss=1.59239 cache_ref=4.98846k cycles=368.757k cycles_per_byte=0.583925 docs_per_sec=6.63659k/s frequency=2.44729G/s instructions=1.35124M instructions_per_byte=2.13968 instructions_per_cycle=3.66432 items=100 items_per_second=663.659k/s [BEST: throughput=  5.38 GB/s doc_throughput=  8521 docs/s instructions=     1351243 cycles=      363340 branch_miss=     227 cache_miss=       2 cache_ref=      4965 items=       100 avg_time=    150679 ns]

This PR:

partial_tweets<simdjson_ondemand>/manual_time     191066 ns       247912 ns         3144 best_branch_miss=388 best_bytes_per_sec=4.57659G best_cache_miss=0 best_cache_ref=3.04k best_cycles=427.13k best_cycles_per_byte=0.676358 best_docs_per_sec=7.24701k best_frequency=3.09541G best_instructions=1.57057M best_instructions_per_byte=2.48698 best_instructions_per_cycle=3.67702 best_items_per_sec=724.701k branch_miss=396.183 bytes=631.515k bytes_per_second=3.07822Gi/s cache_miss=9.86005m cache_ref=3.07924k cycles=439.049k cycles_per_byte=0.695231 docs_per_sec=5.23379k/s frequency=2.29789G/s instructions=1.57057M instructions_per_byte=2.48698 instructions_per_cycle=3.5772 items=100 items_per_second=523.379k/s [BEST: throughput=  4.58 GB/s doc_throughput=  7247 docs/s instructions=     1570566 cycles=      427130 branch_miss=     388 cache_miss=       0 cache_ref=      3040 items=       100 avg_time=    191066 ns]

Conclusion: a regression

@lemire
Copy link
Member Author

lemire commented Feb 3, 2026

@CarlosEduR Did you have a chance to run your own tests? I'd like someone to confirm my good results.

@lemire
Copy link
Member Author

lemire commented Feb 3, 2026

I am getting a performance regression on the static_reflect/twitter_benchmark/benchmark_parsing_twitter benchmark on a zen 5 system. So this PR will require more investigation and tuning.

@lemire
Copy link
Member Author

lemire commented Feb 3, 2026

Possibly @camel-cdr is correct, we might need better machine specific optimization.

@CarlosEduR
Copy link
Member

@lemire it seems there was actually a slight regression, wasn't there?

Based on your comment:

Main:
[BEST: throughput= 5.38 GB/s doc_throughput= 8521 docs/s instructions= 1351243

PR:
[BEST: throughput= 4.58 GB/s doc_throughput= 7247 docs/s instructions= 1570566


I just ran the benchmarks in my machine, Intel i5-10400F (Comet Lake, GCC 13.3):

Benchmark:
./build/benchmark/bench_ondemand --benchmark_filter="find_tweet<simdjson_ondemand>"

Main branch:

find_tweet<simdjson_ondemand>/manual_time     113579 ns       142940 ns         5937 best_branch_miss=1.823k best_bytes_per_sec=5.87472G best_cache_miss=0 best_cache_ref=32.173k best_cycles=462.606k best_cycles_per_byte=0.732534 best_docs_per_sec=9.30259k best_frequency=4.30343G best_instructions=1.47823M best_instructions_per_byte=2.34076 best_instructions_per_cycle=3.19544 best_items_per_sec=9.30259k branch_miss=1.8991k bytes=631.515k bytes_per_second=5.17829Gi/s cache_miss=0.772612 cache_ref=30.7478k cycles=467.607k cycles_per_byte=0.740452 docs_per_sec=8.80445k/s frequency=4.11702G/s instructions=1.47823M instructions_per_byte=2.34076 instructions_per_cycle=3.16126 items=1 items_per_second=8.80445k/s [BEST: throughput=  5.87 GB/s doc_throughput=  9302 docs/s instructions=     1478228 cycles=      462606 branch_miss=    1823 cache_miss=       0 cache_ref=     32173 items=         1 avg_time=    113578 ns]

This PR:

find_tweet<simdjson_ondemand>/manual_time     113476 ns       143510 ns         6015 best_branch_miss=1.88k best_bytes_per_sec=5.85245G best_cache_miss=0 best_cache_ref=33.361k best_cycles=464.429k best_cycles_per_byte=0.73542 best_docs_per_sec=9.26733k best_frequency=4.30401G best_instructions=1.48032M best_instructions_per_byte=2.34407 best_instructions_per_cycle=3.18739 best_items_per_sec=9.26733k branch_miss=1.93755k bytes=631.515k bytes_per_second=5.183Gi/s cache_miss=3.85287 cache_ref=32.9719k cycles=469.394k cycles_per_byte=0.743283 docs_per_sec=8.81247k/s frequency=4.13652G/s instructions=1.48032M instructions_per_byte=2.34407 instructions_per_cycle=3.15367 items=1 items_per_second=8.81247k/s [BEST: throughput=  5.85 GB/s doc_throughput=  9267 docs/s instructions=     1480316 cycles=      464429 branch_miss=    1880 cache_miss=       0 cache_ref=     33361 items=         1 avg_time=    113475 ns]

Benchmark:
./build/benchmark/bench_ondemand --benchmark_filter="partial_tweets<simdjson_ondemand>"

Main branch:

partial_tweets<simdjson_ondemand>/manual_time     160063 ns       178665 ns         4149 best_bytes_per_sec=4.12733G best_docs_per_sec=6.53561k best_items_per_sec=653.561k bytes=631.515k bytes_per_second=3.67446Gi/s docs_per_sec=6.24755k/s items=100 items_per_second=624.755k/s [BEST: throughput=  4.13 GB/s doc_throughput=  6535 docs/s items=       100 avg_time=    160062 ns]

This PR:

partial_tweets<simdjson_ondemand>/manual_time     174554 ns       204422 ns         3910 best_branch_miss=3.276k best_bytes_per_sec=3.76873G best_cache_miss=0 best_cache_ref=50.686k best_cycles=720.946k best_cycles_per_byte=1.14161 best_docs_per_sec=5.96776k best_frequency=4.30243G best_instructions=2.34372M best_instructions_per_byte=3.71126 best_instructions_per_cycle=3.25089 best_items_per_sec=596.776k branch_miss=3.34886k bytes=631.515k bytes_per_second=3.36941Gi/s cache_miss=0.925064 cache_ref=57.1276k cycles=726.496k cycles_per_byte=1.1504 docs_per_sec=5.72889k/s frequency=4.16202G/s instructions=2.34372M instructions_per_byte=3.71126 instructions_per_cycle=3.22606 items=100 items_per_second=572.889k/s [BEST: throughput=  3.77 GB/s doc_throughput=  5967 docs/s instructions=     2343716 cycles=      720946 branch_miss=    3276 cache_miss=       0 cache_ref=     50686 items=       100 avg_time=    174553 ns]

@lemire
Copy link
Member Author

lemire commented Feb 5, 2026

@CarlosEduR This is a useful comment. It confirms the regression on a different system.

We'll have to go with a different design.

@lemire
Copy link
Member Author

lemire commented Feb 6, 2026

A reasonable thing to try would be to only apply the trick for short strings.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Don't copy strings with escape characters

3 participants