1111
1212
1313This module provides regular expression matching operations similar to
14- those found in Perl. Both patterns and strings to be searched can be
15- Unicode strings as well as 8-bit strings. The :mod: `re ` module is
16- always available.
14+ those found in Perl. The :mod: `re ` module is always available.
15+
16+ Both patterns and strings to be searched can be Unicode strings as well as
17+ 8-bit strings. However, Unicode strings and 8-bit strings cannot be mixed:
18+ that is, you cannot match an Unicode string with a byte pattern or
19+ vice-versa; similarly, when asking for a substition, the replacement
20+ string must be of the same type as both the pattern and the search string.
1721
1822Regular expressions use the backslash character (``'\' ``) to indicate
1923special forms or to allow special characters to be used without invoking
@@ -212,12 +216,12 @@ The special characters are:
212216 group; ``(?P<name>...) `` is the only exception to this rule. Following are the
213217 currently supported extensions.
214218
215- ``(?iLmsux ) ``
216- (One or more letters from the set ``'i ' ``, ``'L ' ``, ``'m ' ``, ``'s ' ``,
217- ``'u' ``, ``'x' ``.) The group matches the empty string; the letters
218- set the corresponding flags: :const: `re.I ` (ignore case ),
219- :const: `re.L ` (locale dependent ), :const: `re.M ` (multi-line ),
220- :const: `re.S ` (dot matches all ), :const: `re.U ` (Unicode dependent),
219+ ``(?aiLmsux ) ``
220+ (One or more letters from the set ``'a ' ``, ``'i ' ``, ``'L ' ``, ``'m ' ``,
221+ ``'s' ``, `` ' u' ``, ``'x' ``.) The group matches the empty string; the
222+ letters set the corresponding flags: :const: `re.a ` (ASCII-only matching ),
223+ :const: `re.I ` (ignore case ), :const: `re.L ` (locale dependent ),
224+ :const: `re.M ` (multi-line ), :const: `re.S ` (dot matches all),
221225 and :const: `re.X ` (verbose), for the entire regular expression. (The
222226 flags are described in :ref: `contents-of-module-re `.) This
223227 is useful if you wish to include the flags as part of the regular
@@ -324,56 +328,62 @@ the second character. For example, ``\$`` matches the character ``'$'``.
324328 word is indicated by whitespace or a non-alphanumeric, non-underscore character.
325329 Note that ``\b `` is defined as the boundary between ``\w `` and ``\ W ``, so the
326330 precise set of characters deemed to be alphanumeric depends on the values of the
327- ``UNICODE `` and ``LOCALE `` flags. Inside a character range, ``\b `` represents
331+ ``ASCII `` and ``LOCALE `` flags. Inside a character range, ``\b `` represents
328332 the backspace character, for compatibility with Python's string literals.
329333
330334``\B ``
331335 Matches the empty string, but only when it is *not * at the beginning or end of a
332336 word. This is just the opposite of ``\b ``, so is also subject to the settings
333- of ``LOCALE `` and ``UNICODE `` .
337+ of ``ASCII `` and ``LOCALE `` .
334338
335339``\d ``
336- When the :const: `UNICODE ` flag is not specified, matches any decimal digit; this
337- is equivalent to the set ``[0-9] ``. With :const: `UNICODE `, it will match
338- whatever is classified as a digit in the Unicode character properties database.
340+ For Unicode (str) patterns:
341+ When the :const: `ASCII ` flag is specified, matches any decimal digit; this
342+ is equivalent to the set ``[0-9] ``. Otherwise, it will match whatever
343+ is classified as a digit in the Unicode character properties database
344+ (but this does include the standard ASCII digits and is thus a superset
345+ of [0-9]).
346+ For 8-bit (bytes) patterns:
347+ Matches any decimal digit; this is equivalent to the set ``[0-9] ``.
339348
340349``\D ``
341- When the :const: `UNICODE ` flag is not specified, matches any non-digit
342- character; this is equivalent to the set ``[^0-9] ``. With :const: `UNICODE `, it
343- will match anything other than character marked as digits in the Unicode
344- character properties database.
350+ Matches any character which is not a decimal digit. This is the
351+ opposite of ``\d `` and is therefore similarly subject to the settings of
352+ ``ASCII `` and ``LOCALE ``.
345353
346354``\s ``
347- When the :const: `LOCALE ` and :const: `UNICODE ` flags are not specified, matches
348- any whitespace character; this is equivalent to the set ``[ \t\n\r\f\v] ``. With
349- :const: `LOCALE `, it will match this set plus whatever characters are defined as
350- space for the current locale. If :const: `UNICODE ` is set, this will match the
351- characters ``[ \t\n\r\f\v] `` plus whatever is classified as space in the Unicode
352- character properties database.
355+ For Unicode (str) patterns:
356+ When the :const: `ASCII ` flag is specified, matches only ASCII whitespace
357+ characters; this is equivalent to the set ``[ \t\n\r\f\v] ``. Otherwise,
358+ it will match this set whatever is classified as space in the Unicode
359+ character properties database (including for example the non-breaking
360+ spaces mandated by typography rules in many languages).
361+ For 8-bit (bytes) patterns:
362+ Matches characters considered whitespace in the ASCII character set;
363+ this is equivalent to the set ``[ \t\n\r\f\v] ``.
353364
354365``\S ``
355- When the :const: `LOCALE ` and :const: `UNICODE ` flags are not specified, matches
356- any non-whitespace character; this is equivalent to the set ``[^ \t\n\r\f\v] ``
357- With :const: `LOCALE `, it will match any character not in this set, and not
358- defined as space in the current locale. If :const: `UNICODE ` is set, this will
359- match anything other than ``[ \t\n\r\f\v] `` and characters marked as space in
360- the Unicode character properties database.
366+ Matches any character which is not a whitespace character. This is the
367+ opposite of ``\s `` and is therefore similarly subject to the settings of
368+ ``ASCII `` and ``LOCALE ``.
361369
362370``\w ``
363- When the :const: `LOCALE ` and :const: `UNICODE ` flags are not specified, matches
364- any alphanumeric character and the underscore; this is equivalent to the set
365- ``[a-zA-Z0-9_] ``. With :const: `LOCALE `, it will match the set ``[0-9_] `` plus
366- whatever characters are defined as alphanumeric for the current locale. If
367- :const: `UNICODE ` is set, this will match the characters ``[0-9_] `` plus whatever
368- is classified as alphanumeric in the Unicode character properties database.
371+ For Unicode (str) patterns:
372+ When the :const: `ASCII ` flag is specified, this is equivalent to the set
373+ ``[a-zA-Z0-9_] ``. Otherwise, it will match whatever is classified as
374+ alphanumeric in the Unicode character properties database (it will
375+ include most characters that can be part of a word in whatever language,
376+ as well as numbers and the underscore sign).
377+ For 8-bit (bytes) patterns:
378+ Matches characters considered alphanumeric in the ASCII character set;
379+ this is equivalent to the set ``[a-zA-Z0-9_] ``. With :const: `LOCALE `,
380+ it will additionally match whatever characters are defined as
381+ alphanumeric for the current locale.
369382
370383``\W ``
371- When the :const: `LOCALE ` and :const: `UNICODE ` flags are not specified, matches
372- any non-alphanumeric character; this is equivalent to the set ``[^a-zA-Z0-9_] ``.
373- With :const: `LOCALE `, it will match any character not in the set ``[0-9_] ``, and
374- not defined as alphanumeric for the current locale. If :const: `UNICODE ` is set,
375- this will match anything other than ``[0-9_] `` and characters marked as
376- alphanumeric in the Unicode character properties database.
384+ Matches any character which is not an alphanumeric character. This is the
385+ opposite of ``\w `` and is therefore similarly subject to the settings of
386+ ``ASCII `` and ``LOCALE ``.
377387
378388``\Z ``
379389 Matches only at the end of the string.
@@ -454,6 +464,25 @@ form.
454464 expression at a time needn't worry about compiling regular expressions.)
455465
456466
467+ .. data :: A
468+ ASCII
469+
470+ Make ``\w ``, ``\W ``, ``\b ``, ``\B ``, ``\s `` and ``\S `` perform ASCII-only
471+ matching instead of full Unicode matching. This is only meaningful for
472+ Unicode patterns, and is ignored for byte patterns.
473+
474+ Note that the :const: `re.U ` flag still exists (as well as its synonym
475+ :const: `re.UNICODE ` and its embedded counterpart ``(?u) ``), but it has
476+ become useless in Python 3.0.
477+ In previous Python versions, it was used to specify that
478+ matching had to be Unicode dependent (the default was ASCII matching in
479+ all circumstances). Starting from Python 3.0, the default is Unicode
480+ matching for Unicode strings (which can be changed by specifying the
481+ ``'a' `` flag), and ASCII matching for 8-bit strings. Further, Unicode
482+ dependent matching for 8-bit strings isn't allowed anymore and results
483+ in a ValueError.
484+
485+
457486.. data :: I
458487 IGNORECASE
459488
@@ -465,7 +494,10 @@ form.
465494 LOCALE
466495
467496 Make ``\w ``, ``\W ``, ``\b ``, ``\B ``, ``\s `` and ``\S `` dependent on the
468- current locale.
497+ current locale. The use of this flag is discouraged as the locale mechanism
498+ is very unreliable, and it only handles one "culture" at a time anyway;
499+ you should use Unicode matching instead, which is the default in Python 3.0
500+ for Unicode (str) patterns.
469501
470502
471503.. data :: M
@@ -486,13 +518,6 @@ form.
486518 newline; without this flag, ``'.' `` will match anything *except * a newline.
487519
488520
489- .. data :: U
490- UNICODE
491-
492- Make ``\w ``, ``\W ``, ``\b ``, ``\B ``, ``\d ``, ``\D ``, ``\s `` and ``\S `` dependent
493- on the Unicode character properties database.
494-
495-
496521.. data :: X
497522 VERBOSE
498523
@@ -511,6 +536,8 @@ form.
511536 b = re.compile(r"\d+\.\d*")
512537
513538
539+
540+
514541.. function :: search(pattern, string[, flags])
515542
516543 Scan through *string * looking for a location where the regular expression
0 commit comments