WIP: Various updates to the Regex HOWTO#107825
WIP: Various updates to the Regex HOWTO#107825akuchling wants to merge 26 commits intopython:mainfrom
Conversation
…ing. Remove use of undefined jargon 'cooked'.
Doc/howto/regex.rst
Outdated
| To specify them in the pattern, you can write them as an embedded | ||
| modifier at the start of the pattern that uses the short one-letter | ||
| form: `(?i)` for a single flag or `(?mxi)` to enable multiple flags. |
There was a problem hiding this comment.
It is worth to mentioned "modifier spans" like (?i:...). They are more powerful than global flags and modifiers.
There was a problem hiding this comment.
I think so, but it's also okay to do that in a separate PR. We can iterate and work incrementally.
Doc/howto/regex.rst
Outdated
| For example, the following RE detects doubled words in a string. :: | ||
|
|
||
| >>> p = re.compile(r'\b(\w+)\s+\1\b') | ||
| >>> p = re.compile(r'\b(\w+)\b\s+\1\b') |
There was a problem hiding this comment.
The second \b was removed intentionally. It is not needed here.
It is worth also to use possessive qualifiers here.
There was a problem hiding this comment.
But it's fine to keep the second \b, and when modifying the example for some other context it might be useful. So I'd be fine with keeping it too. (Note that it's mentioned in the text below also.)
(Also, what's a possessive qualifier?)
There was a problem hiding this comment.
Not exactly this example, but see the conversation in #21420 about redundant \b.
This example was fixed in #4443. It was incorrect without \b at the end, but \b between \w and \s is redundant by definition.
Sorry, not "possessive qualifier" but "possessive quantifier" (although in some documents they are named "qualifiers"). A possessive quantifier is a quantifier without backtracking. It is written by adding + to the quantifier (as non-greed quantifiers are written by adding ?). For example, when try to match the pattern with greedy quantifiers \b(\w+)\s+\1\b in "then the", a dumb backtracking engine will try to match "then then", fail, backtrack and try to match consequentially "the ", "th ", "t " until it give up. But with possessive quantifier \b(\w++)\s++\1\b it will not backtrack and fail quicker. It is a new feature in Python 3.11. Even if it is supported in most modern RE engines, it is relatively little known, because it was not initially supported in old RE engines.
There was a problem hiding this comment.
OK, I've removed the second \b and edited the text below a bit.
|
It would be nice to add more about possessive qualifiers and atomic grouping. Modifier spans are also underrated. |
gvanrossum
left a comment
There was a problem hiding this comment.
Hi Andrew! Here are some small suggestions. I recommend merging this rather than sitting on it for much longer. If there are improvements you're still planning to make but don't feel you have time for right now, feel free to open another PR. I promise to review and merge quickly -- this looks like almost everything is uncontroversial.
Doc/howto/regex.rst
Outdated
| To specify them in the pattern, you can write them as an embedded | ||
| modifier at the start of the pattern that uses the short one-letter | ||
| form: `(?i)` for a single flag or `(?mxi)` to enable multiple flags. |
There was a problem hiding this comment.
I think so, but it's also okay to do that in a separate PR. We can iterate and work incrementally.
Doc/howto/regex.rst
Outdated
| For example, the following RE detects doubled words in a string. :: | ||
|
|
||
| >>> p = re.compile(r'\b(\w+)\s+\1\b') | ||
| >>> p = re.compile(r'\b(\w+)\b\s+\1\b') |
There was a problem hiding this comment.
But it's fine to keep the second \b, and when modifying the example for some other context it might be useful. So I'd be fine with keeping it too. (Note that it's mentioned in the text below also.)
(Also, what's a possessive qualifier?)
Co-authored-by: Guido van Rossum <gvanrossum@gmail.com>
…o update-regex-howto
|
OK, I've applied a bunch of suggested revisions, and also adds comments listing future topics such as the possessive quantifiers and spanning modifiers. Let's work on those in future PRs, since this one has already taken long enough! 🕙 |
As people sent me comments over the years, I've been collecting user feedback on the Regex HOWTO. This PR will contain the resulting set of changes. It is currently still work-in-progress; I have a lengthy list of changes that I'm making.
I'll try very hard to keep each commit completely and logically separated, so you may want to proofread commit-by-commit. Feel free to cherry-pick particular commits into main if you like while other commits get worked on; I can rebase or merge and try to keep things coherent.
📚 Documentation preview 📚: https://cpython-previews--107825.org.readthedocs.build/