Expanding on the regex rewriting we did in T317599 we should also be able to support some simple escape sequences which will allow users to more easily search for multiline content. The inclusion of \u should allow users to search for any character that is inconvenient to directly include in the search itself.
Description
Details
| Title | Reference | Author | Source Branch | Dest Branch | |
|---|---|---|---|---|---|
| Bump plugins to 1.3.20-9 | repos/search-platform/cirrussearch-opensearch-image!18 | ebernhardson | work/ebernhardson/plugins-1.3.20-9 | main | |
| Update plugins for regex syntax | repos/search-platform/opensearch-plugins-deb!7 | ebernhardson | work/ebernhardson/regex-update | master |
| Status | Subtype | Assigned | Task | ||
|---|---|---|---|---|---|
| Resolved | EBernhardson | T403212 Support \r, \n, \t, and \uNNNN in insource and intitle queries | |||
| Resolved | RKemper | T403749 Install new wmf-opensearch-search-plugins package/roll-restart CirrusSearch clusters |
Event Timeline
Change #1182923 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):
[wmf-jvm-utils@master] regex: Support escape code expansion
(i'm guessing this isn't ready to announce yet given that the patch isn't currently merged?)
Change #1182923 merged by jenkins-bot:
[wmf-jvm-utils@master] regex: Support escape code expansion
Change #1184587 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):
[search/highlighter@master] Bump lucene-regex-rewriter version to 1.0.6
Change #1184588 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):
[search/extra@master] Bump lucene-regex-rewriter version to 1.0.6
Change #1184587 merged by jenkins-bot:
[search/highlighter@master] Bump lucene-regex-rewriter version to 1.0.6
Change #1184588 merged by jenkins-bot:
[search/extra@master] Bump lucene-regex-rewriter version to 1.0.6
ebernhardson opened https://gitlab.wikimedia.org/repos/search-platform/opensearch-plugins-deb/-/merge_requests/7
Update plugins for regex syntax
ebernhardson merged https://gitlab.wikimedia.org/repos/search-platform/opensearch-plugins-deb/-/merge_requests/7
Update plugins for regex syntax
Indeed it'll take another week or so. Once the subtask to roll-restart the search clusters is complete the functionality should be fully available.
ebernhardson opened https://gitlab.wikimedia.org/repos/search-platform/cirrussearch-opensearch-image/-/merge_requests/18
Bump plugins to 1.3.20-9
ebernhardson merged https://gitlab.wikimedia.org/repos/search-platform/cirrussearch-opensearch-image/-/merge_requests/18
Bump plugins to 1.3.20-9
@EBernhardson Thanks for adding these characters to the docs!
Re the \uNNNN matching, do you think it's worth clarifying the term "surrogate pairs" in those docs? I ask as it took me a few minutes to work out that — to search for the equivalent of intitle:/💔/ — I needed to use intitle:/\uD83D\uDC94/ (instead of e.g. just using intitle:/\u1F494/). Or do you think that people using this escape character would generally be expected to know what this would be referring to? (genuine question)
We discussed in the team a bit before releasing this if \uHHHH was the best format, or if we should go with something else. The basic reasoning here was:
- JSON / javascript use the \uHHHH syntax, we took this to mean that it is a widely understood syntax and shouldn't be too hard for users to find related documentation / examples.
- PCRE2 (common regex library) also makes the \uHHHH syntax available, but only when compiled with specific flags. This is documented as being done because users were so familiar with/expected the JSON style notation
- The search engine accepts multi-byte utf8 characters natively. As shown, we can directly intitle:/💔/ without using the \u escaping. The intended use case for \u escaping is more for unprintable characters, but also includes things that are otherwise inconvenient to type directly.
I don't necessarily think people will know about surrogate pairs just because they are familiar with \u syntax. Surrogate pairs are a bit of a niche technical detail that i suspect many engineers have only a passing familiarity with, with less-technical users being even less familiar. I've added a link to the enwiki section of the UTF-16 encoding article about surrogate pairs, but I'm not sure how to otherwise describe them without going into significantly more detail than is appropriate there.
Support \r, \n, \t, and \uNNNN in insource and intitle queries
Open, Needs Triage
Why is this bug still open given the fix was announced 11 days ago?
Also I wonder what is the point with supporting CR AKA \r given that wiki AFAIK uses LF-text only.
Anyway, if this fix really works then I probably will be able to delete this
VGSTE3MP = "[^0-9A-Za-z\}\{ \|\=\[\]\'\#\.\:\*]{1}"
silly code from my bot.
@EBernhardson Fair enough. Thank you for the explanation, and for adding the link!
I'm guessing not for right now, but - for the future - I wonder if it's worth supporting the use of curly-braces with the \u escape code, in a way like \u{HHHHH} (that can be used with characters like \u{1F494})? I haven't looked into it deeply, but at first glance it seems like this is something that's supported by JavaScript (using \u{1F494}) & by PCRE2 (using \x{1F494}).
(I can create a separate task for it if you think it might be worth considering?)