Support \r, \n, \t, and \uNNNN in insource and intitle queries
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	EBernhardson
	Aug 28 2025, 8:26 PM

Description

Expanding on the regex rewriting we did in T317599 we should also be able to support some simple escape sequences which will allow users to more easily search for multiline content. The inclusion of \u should allow users to search for any character that is inconvenient to directly include in the search itself.

Details

Related Changes in Gerrit:

Subject	Repo	Branch	Lines +/-
Bump lucene-regex-rewriter version to 1.0.6	search/extra	master	+1 -1
Bump lucene-regex-rewriter version to 1.0.6	search/highlighter	master	+1 -1
regex: Support escape code expansion	wmf-jvm-utils	master	+252 -4

Customize query in gerrit

Related Changes in GitLab:

	Title	Reference	Author	Source Branch	Dest Branch
	Bump plugins to 1.3.20-9	repos/search-platform/cirrussearch-opensearch-image!18	ebernhardson	work/ebernhardson/plugins-1.3.20-9	main
	Update plugins for regex syntax	repos/search-platform/opensearch-plugins-deb!7	ebernhardson	work/ebernhardson/regex-update	master

Customize query in GitLab

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		EBernhardson	T403212 Support \r, \n, \t, and \uNNNN in insource and intitle queries
		Resolved		RKemper	T403749 Install new wmf-opensearch-search-plugins package/roll-restart CirrusSearch clusters

Event Timeline

EBernhardson created this task.Aug 28 2025, 8:26 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 28 2025, 8:26 PM

EBernhardson edited projects, added Discovery-Search (2025.08.15 - 2025.09.05); removed Discovery-Search.Aug 28 2025, 8:26 PM

Change #1182923 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[wmf-jvm-utils@master] regex: Support escape code expansion

https://gerrit.wikimedia.org/r/1182923

gerritbot added a project: Patch-For-Review.Aug 28 2025, 8:29 PM

Nemoralis added a project: User-notice.Aug 28 2025, 9:02 PM

Izno mentioned this in T317599: Allow ^ and $ in intitle regex search.Aug 28 2025, 9:26 PM

Izno subscribed.

A_smart_kitten added a project: CirrusSearch.Aug 29 2025, 10:19 AM

A_smart_kitten subscribed.

UOzurumba moved this task from To Triage to Announce in next Tech/News on the User-notice board.Aug 29 2025, 2:15 PM

EBernhardson moved this task from Incoming to Needs Review on the Discovery-Search (2025.08.15 - 2025.09.05) board.Aug 29 2025, 3:32 PM

(i'm guessing this isn't ready to announce yet given that the patch isn't currently merged?)

Change #1182923 merged by jenkins-bot:

[wmf-jvm-utils@master] regex: Support escape code expansion

https://gerrit.wikimedia.org/r/1182923

EBernhardson mentioned this in rWJVMf244ef65a2fa: regex: Support escape code expansion.Sep 3 2025, 8:10 PM

Maintenance_bot removed a project: Patch-For-Review.Sep 3 2025, 8:30 PM

Change #1184587 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[search/highlighter@master] Bump lucene-regex-rewriter version to 1.0.6

https://gerrit.wikimedia.org/r/1184587

Change #1184588 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[search/extra@master] Bump lucene-regex-rewriter version to 1.0.6

https://gerrit.wikimedia.org/r/1184588

Change #1184587 merged by jenkins-bot:

[search/highlighter@master] Bump lucene-regex-rewriter version to 1.0.6

https://gerrit.wikimedia.org/r/1184587

Change #1184588 merged by jenkins-bot:

[search/extra@master] Bump lucene-regex-rewriter version to 1.0.6

https://gerrit.wikimedia.org/r/1184588

Maintenance_bot removed a project: Patch-For-Review.Sep 3 2025, 9:30 PM

ebernhardson opened https://gitlab.wikimedia.org/repos/search-platform/opensearch-plugins-deb/-/merge_requests/7

Update plugins for regex syntax

ebernhardson merged https://gitlab.wikimedia.org/repos/search-platform/opensearch-plugins-deb/-/merge_requests/7

Update plugins for regex syntax

Maintenance_bot removed a project: Patch-For-Review.Sep 4 2025, 3:31 PM

In T403212#11133330, @A_smart_kitten wrote:

(i'm guessing this isn't ready to announce yet given that the patch isn't currently merged?)

Indeed it'll take another week or so. Once the subtask to roll-restart the search clusters is complete the functionality should be fully available.

EBernhardson updated the task description. (Show Details)Sep 4 2025, 4:23 PM

ebernhardson opened https://gitlab.wikimedia.org/repos/search-platform/cirrussearch-opensearch-image/-/merge_requests/18

Bump plugins to 1.3.20-9

EBernhardson moved this task from Needs Review to To be Deployed on the Discovery-Search (2025.08.15 - 2025.09.05) board.Sep 4 2025, 6:42 PM

bking changed the status of subtask T403749: Install new wmf-opensearch-search-plugins package/roll-restart CirrusSearch clusters from Open to In Progress.Sep 4 2025, 9:55 PM

Gehel edited projects, added Discovery-Search (2025.09.05 - 2025.09.26); removed Discovery-Search (2025.08.15 - 2025.09.05).Sep 5 2025, 8:22 AM

Gehel moved this task from Incoming to To be Deployed on the Discovery-Search (2025.09.05 - 2025.09.26) board.

bking closed subtask T403749: Install new wmf-opensearch-search-plugins package/roll-restart CirrusSearch clusters as Resolved.Sep 5 2025, 1:58 PM

Jack_who_built_the_house merged a task: T135280: insource: queries can't search for the return character.Sep 7 2025, 3:37 PM

Jack_who_built_the_house added subscribers: Jack_who_built_the_house, doctaxon, Nirmos and 3 others.

EBernhardson moved this task from To be Deployed to Done on the Discovery-Search (2025.09.05 - 2025.09.26) board.Sep 8 2025, 3:18 PM

ebernhardson merged https://gitlab.wikimedia.org/repos/search-platform/cirrussearch-opensearch-image/-/merge_requests/18

Bump plugins to 1.3.20-9

Maintenance_bot removed a project: Patch-For-Review.Sep 8 2025, 5:31 PM

pfischer moved this task from Done to Reported on the Discovery-Search (2025.09.05 - 2025.09.26) board.Sep 12 2025, 9:31 AM

@EBernhardson Thanks for adding these characters to the docs!
Re the \uNNNN matching, do you think it's worth clarifying the term "surrogate pairs" in those docs? I ask as it took me a few minutes to work out that — to search for the equivalent of intitle:/💔/ — I needed to use intitle:/\uD83D\uDC94/ (instead of e.g. just using intitle:/\u1F494/). Or do you think that people using this escape character would generally be expected to know what this would be referring to? (genuine question)

In T403212#11175588, @A_smart_kitten wrote:

@EBernhardson Thanks for adding these characters to the docs!
Re the \uNNNN matching, do you think it's worth clarifying the term "surrogate pairs" in those docs? I ask as it took me a few minutes to work out that — to search for the equivalent of intitle:/💔/ — I needed to use intitle:/\uD83D\uDC94/ (instead of e.g. just using intitle:/\u1F494/). Or do you think that people using this escape character would generally be expected to know what this would be referring to? (genuine question)

We discussed in the team a bit before releasing this if \uHHHH was the best format, or if we should go with something else. The basic reasoning here was:

JSON / javascript use the \uHHHH syntax, we took this to mean that it is a widely understood syntax and shouldn't be too hard for users to find related documentation / examples.
PCRE2 (common regex library) also makes the \uHHHH syntax available, but only when compiled with specific flags. This is documented as being done because users were so familiar with/expected the JSON style notation
The search engine accepts multi-byte utf8 characters natively. As shown, we can directly intitle:/💔/ without using the \u escaping. The intended use case for \u escaping is more for unprintable characters, but also includes things that are otherwise inconvenient to type directly.

I don't necessarily think people will know about surrogate pairs just because they are familiar with \u syntax. Surrogate pairs are a bit of a niche technical detail that i suspect many engineers have only a passing familiarity with, with less-technical users being even less familiar. I've added a link to the enwiki section of the UTF-16 encoding article about surrogate pairs, but I'm not sure how to otherwise describe them without going into significantly more detail than is appropriate there.

Quiddity moved this task from Not ready to announce to In current Tech/News draft on the User-notice board.Sep 12 2025, 10:25 PM

IKhitron awarded a token.Sep 13 2025, 4:45 AM

Erutuon mentioned this in T404632: Searches for surrogate code units display search results with deleted characters.Sep 15 2025, 6:17 PM

Alien333 awarded a token.Sep 15 2025, 8:09 PM

Alien333 subscribed.

Nux awarded a token.Sep 15 2025, 9:51 PM

Support \r, \n, \t, and \uNNNN in insource and intitle queries
Open, Needs Triage

Why is this bug still open given the fix was announced 11 days ago?

Also I wonder what is the point with supporting CR AKA \r given that wiki AFAIK uses LF-text only.

Anyway, if this fix really works then I probably will be able to delete this

VGSTE3MP = "[^0-9A-Za-z\}\{ \|\=\[\]\'\#\.\:\*]{1}"

silly code from my bot.

In T403212#11177144, @EBernhardson wrote:

In T403212#11175588, @A_smart_kitten wrote:

@EBernhardson Thanks for adding these characters to the docs!
Re the \uNNNN matching, do you think it's worth clarifying the term "surrogate pairs" in those docs? I ask as it took me a few minutes to work out that — to search for the equivalent of intitle:/💔/ — I needed to use intitle:/\uD83D\uDC94/ (instead of e.g. just using intitle:/\u1F494/). Or do you think that people using this escape character would generally be expected to know what this would be referring to? (genuine question)

We discussed in the team a bit before releasing this if \uHHHH was the best format, or if we should go with something else. The basic reasoning here was:

JSON / javascript use the \uHHHH syntax, we took this to mean that it is a widely understood syntax and shouldn't be too hard for users to find related documentation / examples.

PCRE2 (common regex library) also makes the \uHHHH syntax available, but only when compiled with specific flags. This is documented as being done because users were so familiar with/expected the JSON style notation

The search engine accepts multi-byte utf8 characters natively. As shown, we can directly intitle:/💔/ without using the \u escaping. The intended use case for \u escaping is more for unprintable characters, but also includes things that are otherwise inconvenient to type directly.

I don't necessarily think people will know about surrogate pairs just because they are familiar with \u syntax. Surrogate pairs are a bit of a niche technical detail that i suspect many engineers have only a passing familiarity with, with less-technical users being even less familiar. I've added a link to the enwiki section of the UTF-16 encoding article about surrogate pairs, but I'm not sure how to otherwise describe them without going into significantly more detail than is appropriate there.

@EBernhardson Fair enough. Thank you for the explanation, and for adding the link!

I'm guessing not for right now, but - for the future - I wonder if it's worth supporting the use of curly-braces with the \u escape code, in a way like \u{HHHHH} (that can be used with characters like \u{1F494})? I haven't looked into it deeply, but at first glance it seems like this is something that's supported by JavaScript (using \u{1F494}) & by PCRE2 (using \x{1F494}).
(I can create a separate task for it if you think it might be worth considering?)

UOzurumba moved this task from In current Tech/News draft to Already announced/Archive on the User-notice board.Sep 16 2025, 10:40 PM

Gehel closed this task as Resolved.Oct 3 2025, 8:10 AM

Maintenance_bot edited projects, added User-notice-archive; removed User-notice.Mon, Oct 13, 9:31 AM

Support \r, \n, \t, and \uNNNN in insource and intitle queriesClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Support \r, \n, \t, and \uNNNN in insource and intitle queries
Closed, ResolvedPublic
Actions

Related Objects
Search...