Re-escape wildcard operators produced by the normalizer in wildcard queries#151008
Open
Incheonkirin wants to merge 3 commits into
Open
Re-escape wildcard operators produced by the normalizer in wildcard queries#151008Incheonkirin wants to merge 3 commits into
Incheonkirin wants to merge 3 commits into
Conversation
…ueries
When a keyword field uses a normalizer that maps to or from the wildcard control
characters * ? \ (most commonly ICU NFKC, which maps fullwidth forms such as
* ? \ to ASCII), wildcard queries returned incorrect results via
StringFieldType.normalizeWildcardPattern:
- Bug 1: the contents of an escape sequence \X were appended verbatim, so the
escaped character was never normalized. A literal search such as \* looked
for a character absent from the index (0 hits).
- Bug 2: a normalized literal was appended without re-escaping, so a normalizer
that emits * ? \ turned literal data into an operator. A bare * became a
wildcard and over-matched.
Accumulate contiguous literal characters across plain text and escape sequences
and normalize each run as a whole, so context-sensitive normalizers (e.g. ICU
NFC composition) see the full run. Re-escape any * ? \ the normalizer produces,
and preserve ? * operator runs verbatim. WILDCARD_PATTERN is now DOTALL so an
escape before a line terminator is still recognised as an escape.
|
💚 CLA has been signed |
Collaborator
|
Pinging @elastic/es-search-foundations (Team:Search Foundations) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
wildcardqueries on akeywordfield can return wrong results when the field uses a normalizer that rewrites the wildcard control characters*,?, or\. The most common trigger is ICU NFKC, which maps fullwidth forms to ASCII (*→*,?→?,\→\).normalizeWildcardPatternnormalizes the literal parts of the pattern and keeps?/*as operators, but had two defects:\Xwere appended verbatim, so the escaped character was never normalized —\*searched for a character that isn't in the index and returned 0 hits.*/?/\turned literal data into an operator — a bare*became a wildcard and over-matched.The fix gathers each contiguous literal run (across plain text and escapes) and normalizes it as a whole, so context-sensitive normalizers (e.g. ICU NFC) see the full run, then re-escapes any
*/?/\the normalizer produces. Operator runs are kept verbatim, andWILDCARD_PATTERNnow usesDOTALLso an escape before a line terminator is still treated as an escape.One behavior change: a trailing lone backslash (
abc\) now normalizes to a literal backslash\\instead of passing through as a raw\; both are equivalent to Lucene.Closes #150699