WIP: Various updates to the Regex HOWTO by akuchling · Pull Request #107825 · python/cpython

akuchling · 2023-08-09T23:15:18Z

As people sent me comments over the years, I've been collecting user feedback on the Regex HOWTO. This PR will contain the resulting set of changes. It is currently still work-in-progress; I have a lengthy list of changes that I'm making.

I'll try very hard to keep each commit completely and logically separated, so you may want to proofread commit-by-commit. Feel free to cherry-pick particular commits into main if you like while other commits get worked on; I can rebase or merge and try to keep things coherent.

📚 Documentation preview 📚: https://cpython-previews--107825.org.readthedocs.build/

…ing. Remove use of undefined jargon 'cooked'.

…them is an error

Doc/howto/regex.rst

serhiy-storchaka · 2023-08-10T21:03:14Z

Doc/howto/regex.rst

+To specify them in the pattern, you can write them as an embedded
+modifier at the start of the pattern that uses the short one-letter
+form: `(?i)` for a single flag or `(?mxi)` to enable multiple flags.


It is worth to mentioned "modifier spans" like (?i:...). They are more powerful than global flags and modifiers.

I think so, but it's also okay to do that in a separate PR. We can iterate and work incrementally.

serhiy-storchaka · 2023-08-10T21:05:39Z

Doc/howto/regex.rst

 For example, the following RE detects doubled words in a string. ::

-   >>> p = re.compile(r'\b(\w+)\s+\1\b')
+   >>> p = re.compile(r'\b(\w+)\b\s+\1\b')


The second \b was removed intentionally. It is not needed here.

It is worth also to use possessive qualifiers here.

But it's fine to keep the second \b, and when modifying the example for some other context it might be useful. So I'd be fine with keeping it too. (Note that it's mentioned in the text below also.)

(Also, what's a possessive qualifier?)

Not exactly this example, but see the conversation in #21420 about redundant \b.

This example was fixed in #4443. It was incorrect without \b at the end, but \b between \w and \s is redundant by definition.

Sorry, not "possessive qualifier" but "possessive quantifier" (although in some documents they are named "qualifiers"). A possessive quantifier is a quantifier without backtracking. It is written by adding + to the quantifier (as non-greed quantifiers are written by adding ?). For example, when try to match the pattern with greedy quantifiers \b(\w+)\s+\1\b in "then the", a dumb backtracking engine will try to match "then then", fail, backtrack and try to match consequentially "the ", "th ", "t " until it give up. But with possessive quantifier \b(\w++)\s++\1\b it will not backtrack and fail quicker. It is a new feature in Python 3.11. Even if it is supported in most modern RE engines, it is relatively little known, because it was not initially supported in old RE engines.

See https://www.regular-expressions.info/possessive.html

OK, I've removed the second \b and edited the text below a bit.

serhiy-storchaka · 2023-08-10T21:12:50Z

It would be nice to add more about possessive qualifiers and atomic grouping. Modifier spans are also underrated.

Doc/howto/regex.rst

gvanrossum

Hi Andrew! Here are some small suggestions. I recommend merging this rather than sitting on it for much longer. If there are improvements you're still planning to make but don't feel you have time for right now, feel free to open another PR. I promise to review and merge quickly -- this looks like almost everything is uncontroversial.

Doc/howto/regex.rst

gvanrossum · 2023-12-25T02:34:11Z

Doc/howto/regex.rst

+To specify them in the pattern, you can write them as an embedded
+modifier at the start of the pattern that uses the short one-letter
+form: `(?i)` for a single flag or `(?mxi)` to enable multiple flags.


I think so, but it's also okay to do that in a separate PR. We can iterate and work incrementally.

gvanrossum · 2023-12-25T02:37:34Z

Doc/howto/regex.rst

 For example, the following RE detects doubled words in a string. ::

-   >>> p = re.compile(r'\b(\w+)\s+\1\b')
+   >>> p = re.compile(r'\b(\w+)\b\s+\1\b')


But it's fine to keep the second \b, and when modifying the example for some other context it might be useful. So I'd be fine with keeping it too. (Note that it's mentioned in the text below also.)

(Also, what's a possessive qualifier?)

Doc/howto/regex.rst

Co-authored-by: Guido van Rossum <gvanrossum@gmail.com>

Co-authored-by: Bénédikt Tran <10796600+picnixz@users.noreply.github.com>

…o update-regex-howto

akuchling · 2024-09-25T02:04:33Z

OK, I've applied a bunch of suggested revisions, and also adds comments listing future topics such as the possessive quantifiers and spanning modifiers. Let's work on those in future PRs, since this one has already taken long enough! 🕙

akuchling added 7 commits August 9, 2023 19:13

Remove most uses of the word 'obvious'

9580098

Unrecognized escapes now raise a SyntaxWarning, not a DeprecationWarn…

808e281

…ing. Remove use of undefined jargon 'cooked'.

Add paragraph break

88bbe21

Remove extra parathesis from an example

bdf44f2

Describe .fullmatch() method

ce864e2

Fix bug in doubled-word example, and try to clarify the explanation

1eaa7ac

Clarify discussion of named groups

a4038d8

bedevere-bot added awaiting review docs Documentation in the Doc dir skip news labels Aug 9, 2023

akuchling added 2 commits August 9, 2023 19:41

Mention := operator

662e646

Describe how to use flags, and embedded modifiers such as (?x)

132b3e6

akuchling changed the title ~~Various updates to the Regex HOWTO~~ WIP: Various updates to the Regex HOWTO Aug 10, 2023

akuchling added 3 commits August 9, 2023 20:21

re.sub() now has a flags argument

05555df

Make re.sub() and re.split() signature match the current module

acd1460

Move discussion of zero-width assertions, and clarify that repeating …

f879c88

…them is an error

serhiy-storchaka self-requested a review August 10, 2023 20:49

serhiy-storchaka reviewed Aug 10, 2023

View reviewed changes

Move fullmatch() above match(), and re-word this table a bit

d9e8ddf

picnixz reviewed Sep 9, 2023

View reviewed changes

Doc/howto/regex.rst Outdated Show resolved Hide resolved

Doc/howto/regex.rst Show resolved Hide resolved

serhiy-storchaka marked this pull request as draft December 1, 2023 09:17

bedevere-app bot removed the awaiting review label Dec 1, 2023

gvanrossum mentioned this pull request Dec 25, 2023

gh-107432: Rework the style of Doc/howto/functional.rst #107449

Open

gvanrossum reviewed Dec 25, 2023

View reviewed changes

akuchling added 5 commits September 24, 2024 19:40

Merge branch 'main' into update-regex-howto

2db6962

Fix some lint-detected markup issues

e370981

Remove \b from double-word example

e1b084c

Add comments listing future work

0e855ea

Break long line

7a97af3

akuchling and others added 8 commits September 24, 2024 21:43

Use same word in example

961a4ef

Update Doc/howto/regex.rst

b0258f6

Co-authored-by: Guido van Rossum <gvanrossum@gmail.com>

PEP8

bb9497d

Co-authored-by: Guido van Rossum <gvanrossum@gmail.com>

Typo fix

23c2934

Co-authored-by: Guido van Rossum <gvanrossum@gmail.com>

PEP8

4752488

Co-authored-by: Bénédikt Tran <10796600+picnixz@users.noreply.github.com>

PEP8

05c04ec

Merge branch 'update-regex-howto' of github.com:akuchling/cpython int…

39b89c3

…o update-regex-howto

Remove somewhat off-topic sentence

900c50b

akuchling marked this pull request as ready for review September 25, 2024 02:00

bedevere-app bot added the awaiting review label Sep 25, 2024

StanFromIreland added the skip issue label Aug 25, 2025

Uh oh!

Conversation

akuchling commented Aug 9, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

serhiy-storchaka commented Aug 10, 2023

Uh oh!

Uh oh!

Uh oh!

gvanrossum left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

akuchling commented Sep 25, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

akuchling commented Aug 9, 2023 •

edited

Loading