Skip to content

Re-escape wildcard operators produced by the normalizer in wildcard queries#151008

Open
Incheonkirin wants to merge 3 commits into
elastic:mainfrom
Incheonkirin:fix/wildcard-normalizer-reescape-150699
Open

Re-escape wildcard operators produced by the normalizer in wildcard queries#151008
Incheonkirin wants to merge 3 commits into
elastic:mainfrom
Incheonkirin:fix/wildcard-normalizer-reescape-150699

Conversation

@Incheonkirin

Copy link
Copy Markdown

wildcard queries on a keyword field can return wrong results when the field uses a normalizer that rewrites the wildcard control characters *, ?, or \. The most common trigger is ICU NFKC, which maps fullwidth forms to ASCII (*, ?, \).

normalizeWildcardPattern normalizes the literal parts of the pattern and keeps ?/* as operators, but had two defects:

  • The contents of an escape \X were appended verbatim, so the escaped character was never normalized — \* searched for a character that isn't in the index and returned 0 hits.
  • A normalized literal wasn't re-escaped, so a normalizer emitting */?/\ turned literal data into an operator — a bare became a wildcard and over-matched.

The fix gathers each contiguous literal run (across plain text and escapes) and normalizes it as a whole, so context-sensitive normalizers (e.g. ICU NFC) see the full run, then re-escapes any */?/\ the normalizer produces. Operator runs are kept verbatim, and WILDCARD_PATTERN now uses DOTALL so an escape before a line terminator is still treated as an escape.

One behavior change: a trailing lone backslash (abc\) now normalizes to a literal backslash \\ instead of passing through as a raw \; both are equivalent to Lucene.

Closes #150699

…ueries

When a keyword field uses a normalizer that maps to or from the wildcard control
characters * ? \ (most commonly ICU NFKC, which maps fullwidth forms such as
* ? \ to ASCII), wildcard queries returned incorrect results via
StringFieldType.normalizeWildcardPattern:

  - Bug 1: the contents of an escape sequence \X were appended verbatim, so the
    escaped character was never normalized. A literal search such as \* looked
    for a character absent from the index (0 hits).
  - Bug 2: a normalized literal was appended without re-escaping, so a normalizer
    that emits * ? \ turned literal data into an operator. A bare * became a
    wildcard and over-matched.

Accumulate contiguous literal characters across plain text and escape sequences
and normalize each run as a whole, so context-sensitive normalizers (e.g. ICU
NFC composition) see the full run. Re-escape any * ? \ the normalizer produces,
and preserve ? * operator runs verbatim. WILDCARD_PATTERN is now DOTALL so an
escape before a line terminator is still recognised as an escape.
@cla-checker-service

cla-checker-service Bot commented Jun 8, 2026

Copy link
Copy Markdown

💚 CLA has been signed

@elasticsearchmachine elasticsearchmachine added v9.5.0 needs:triage Requires assignment of a team area label external-contributor Pull request authored by a developer outside the Elasticsearch team labels Jun 8, 2026
@benchaplin benchaplin added :Search Foundations/Mapping Index mappings, including merging and defining field types and removed needs:triage Requires assignment of a team area label labels Jun 10, 2026
@elasticsearchmachine

Copy link
Copy Markdown
Collaborator

Pinging @elastic/es-search-foundations (Team:Search Foundations)

@elasticsearchmachine elasticsearchmachine added the Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch label Jun 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

external-contributor Pull request authored by a developer outside the Elasticsearch team :Search Foundations/Mapping Index mappings, including merging and defining field types Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch v9.5.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Wildcard query normalizer mishandles escapes and re-escaping, breaking literal search for any character that normalizes to or from * ? \

3 participants