Skip to content

Commit fd03645

Browse files
committed
python#2834: Change re module semantics, so that str and bytes mixing is forbidden,
and str (unicode) patterns get full unicode matching by default. The re.ASCII flag is also introduced to ask for ASCII matching instead.
1 parent 3ad7ba1 commit fd03645

37 files changed

+280
-163
lines changed

Doc/library/re.rst

Lines changed: 77 additions & 50 deletions
Original file line numberDiff line numberDiff line change
@@ -11,9 +11,13 @@
1111

1212

1313
This module provides regular expression matching operations similar to
14-
those found in Perl. Both patterns and strings to be searched can be
15-
Unicode strings as well as 8-bit strings. The :mod:`re` module is
16-
always available.
14+
those found in Perl. The :mod:`re` module is always available.
15+
16+
Both patterns and strings to be searched can be Unicode strings as well as
17+
8-bit strings. However, Unicode strings and 8-bit strings cannot be mixed:
18+
that is, you cannot match an Unicode string with a byte pattern or
19+
vice-versa; similarly, when asking for a substition, the replacement
20+
string must be of the same type as both the pattern and the search string.
1721

1822
Regular expressions use the backslash character (``'\'``) to indicate
1923
special forms or to allow special characters to be used without invoking
@@ -212,12 +216,12 @@ The special characters are:
212216
group; ``(?P<name>...)`` is the only exception to this rule. Following are the
213217
currently supported extensions.
214218

215-
``(?iLmsux)``
216-
(One or more letters from the set ``'i'``, ``'L'``, ``'m'``, ``'s'``,
217-
``'u'``, ``'x'``.) The group matches the empty string; the letters
218-
set the corresponding flags: :const:`re.I` (ignore case),
219-
:const:`re.L` (locale dependent), :const:`re.M` (multi-line),
220-
:const:`re.S` (dot matches all), :const:`re.U` (Unicode dependent),
219+
``(?aiLmsux)``
220+
(One or more letters from the set ``'a'``, ``'i'``, ``'L'``, ``'m'``,
221+
``'s'``, ``'u'``, ``'x'``.) The group matches the empty string; the
222+
letters set the corresponding flags: :const:`re.a` (ASCII-only matching),
223+
:const:`re.I` (ignore case), :const:`re.L` (locale dependent),
224+
:const:`re.M` (multi-line), :const:`re.S` (dot matches all),
221225
and :const:`re.X` (verbose), for the entire regular expression. (The
222226
flags are described in :ref:`contents-of-module-re`.) This
223227
is useful if you wish to include the flags as part of the regular
@@ -324,56 +328,62 @@ the second character. For example, ``\$`` matches the character ``'$'``.
324328
word is indicated by whitespace or a non-alphanumeric, non-underscore character.
325329
Note that ``\b`` is defined as the boundary between ``\w`` and ``\ W``, so the
326330
precise set of characters deemed to be alphanumeric depends on the values of the
327-
``UNICODE`` and ``LOCALE`` flags. Inside a character range, ``\b`` represents
331+
``ASCII`` and ``LOCALE`` flags. Inside a character range, ``\b`` represents
328332
the backspace character, for compatibility with Python's string literals.
329333

330334
``\B``
331335
Matches the empty string, but only when it is *not* at the beginning or end of a
332336
word. This is just the opposite of ``\b``, so is also subject to the settings
333-
of ``LOCALE`` and ``UNICODE``.
337+
of ``ASCII`` and ``LOCALE`` .
334338

335339
``\d``
336-
When the :const:`UNICODE` flag is not specified, matches any decimal digit; this
337-
is equivalent to the set ``[0-9]``. With :const:`UNICODE`, it will match
338-
whatever is classified as a digit in the Unicode character properties database.
340+
For Unicode (str) patterns:
341+
When the :const:`ASCII` flag is specified, matches any decimal digit; this
342+
is equivalent to the set ``[0-9]``. Otherwise, it will match whatever
343+
is classified as a digit in the Unicode character properties database
344+
(but this does include the standard ASCII digits and is thus a superset
345+
of [0-9]).
346+
For 8-bit (bytes) patterns:
347+
Matches any decimal digit; this is equivalent to the set ``[0-9]``.
339348

340349
``\D``
341-
When the :const:`UNICODE` flag is not specified, matches any non-digit
342-
character; this is equivalent to the set ``[^0-9]``. With :const:`UNICODE`, it
343-
will match anything other than character marked as digits in the Unicode
344-
character properties database.
350+
Matches any character which is not a decimal digit. This is the
351+
opposite of ``\d`` and is therefore similarly subject to the settings of
352+
``ASCII`` and ``LOCALE``.
345353

346354
``\s``
347-
When the :const:`LOCALE` and :const:`UNICODE` flags are not specified, matches
348-
any whitespace character; this is equivalent to the set ``[ \t\n\r\f\v]``. With
349-
:const:`LOCALE`, it will match this set plus whatever characters are defined as
350-
space for the current locale. If :const:`UNICODE` is set, this will match the
351-
characters ``[ \t\n\r\f\v]`` plus whatever is classified as space in the Unicode
352-
character properties database.
355+
For Unicode (str) patterns:
356+
When the :const:`ASCII` flag is specified, matches only ASCII whitespace
357+
characters; this is equivalent to the set ``[ \t\n\r\f\v]``. Otherwise,
358+
it will match this set whatever is classified as space in the Unicode
359+
character properties database (including for example the non-breaking
360+
spaces mandated by typography rules in many languages).
361+
For 8-bit (bytes) patterns:
362+
Matches characters considered whitespace in the ASCII character set;
363+
this is equivalent to the set ``[ \t\n\r\f\v]``.
353364

354365
``\S``
355-
When the :const:`LOCALE` and :const:`UNICODE` flags are not specified, matches
356-
any non-whitespace character; this is equivalent to the set ``[^ \t\n\r\f\v]``
357-
With :const:`LOCALE`, it will match any character not in this set, and not
358-
defined as space in the current locale. If :const:`UNICODE` is set, this will
359-
match anything other than ``[ \t\n\r\f\v]`` and characters marked as space in
360-
the Unicode character properties database.
366+
Matches any character which is not a whitespace character. This is the
367+
opposite of ``\s`` and is therefore similarly subject to the settings of
368+
``ASCII`` and ``LOCALE``.
361369

362370
``\w``
363-
When the :const:`LOCALE` and :const:`UNICODE` flags are not specified, matches
364-
any alphanumeric character and the underscore; this is equivalent to the set
365-
``[a-zA-Z0-9_]``. With :const:`LOCALE`, it will match the set ``[0-9_]`` plus
366-
whatever characters are defined as alphanumeric for the current locale. If
367-
:const:`UNICODE` is set, this will match the characters ``[0-9_]`` plus whatever
368-
is classified as alphanumeric in the Unicode character properties database.
371+
For Unicode (str) patterns:
372+
When the :const:`ASCII` flag is specified, this is equivalent to the set
373+
``[a-zA-Z0-9_]``. Otherwise, it will match whatever is classified as
374+
alphanumeric in the Unicode character properties database (it will
375+
include most characters that can be part of a word in whatever language,
376+
as well as numbers and the underscore sign).
377+
For 8-bit (bytes) patterns:
378+
Matches characters considered alphanumeric in the ASCII character set;
379+
this is equivalent to the set ``[a-zA-Z0-9_]``. With :const:`LOCALE`,
380+
it will additionally match whatever characters are defined as
381+
alphanumeric for the current locale.
369382

370383
``\W``
371-
When the :const:`LOCALE` and :const:`UNICODE` flags are not specified, matches
372-
any non-alphanumeric character; this is equivalent to the set ``[^a-zA-Z0-9_]``.
373-
With :const:`LOCALE`, it will match any character not in the set ``[0-9_]``, and
374-
not defined as alphanumeric for the current locale. If :const:`UNICODE` is set,
375-
this will match anything other than ``[0-9_]`` and characters marked as
376-
alphanumeric in the Unicode character properties database.
384+
Matches any character which is not an alphanumeric character. This is the
385+
opposite of ``\w`` and is therefore similarly subject to the settings of
386+
``ASCII`` and ``LOCALE``.
377387

378388
``\Z``
379389
Matches only at the end of the string.
@@ -454,6 +464,25 @@ form.
454464
expression at a time needn't worry about compiling regular expressions.)
455465
456466
467+
.. data:: A
468+
ASCII
469+
470+
Make ``\w``, ``\W``, ``\b``, ``\B``, ``\s`` and ``\S`` perform ASCII-only
471+
matching instead of full Unicode matching. This is only meaningful for
472+
Unicode patterns, and is ignored for byte patterns.
473+
474+
Note that the :const:`re.U` flag still exists (as well as its synonym
475+
:const:`re.UNICODE` and its embedded counterpart ``(?u)``), but it has
476+
become useless in Python 3.0.
477+
In previous Python versions, it was used to specify that
478+
matching had to be Unicode dependent (the default was ASCII matching in
479+
all circumstances). Starting from Python 3.0, the default is Unicode
480+
matching for Unicode strings (which can be changed by specifying the
481+
``'a'`` flag), and ASCII matching for 8-bit strings. Further, Unicode
482+
dependent matching for 8-bit strings isn't allowed anymore and results
483+
in a ValueError.
484+
485+
457486
.. data:: I
458487
IGNORECASE
459488

@@ -465,7 +494,10 @@ form.
465494
LOCALE
466495

467496
Make ``\w``, ``\W``, ``\b``, ``\B``, ``\s`` and ``\S`` dependent on the
468-
current locale.
497+
current locale. The use of this flag is discouraged as the locale mechanism
498+
is very unreliable, and it only handles one "culture" at a time anyway;
499+
you should use Unicode matching instead, which is the default in Python 3.0
500+
for Unicode (str) patterns.
469501

470502

471503
.. data:: M
@@ -486,13 +518,6 @@ form.
486518
newline; without this flag, ``'.'`` will match anything *except* a newline.
487519

488520

489-
.. data:: U
490-
UNICODE
491-
492-
Make ``\w``, ``\W``, ``\b``, ``\B``, ``\d``, ``\D``, ``\s`` and ``\S`` dependent
493-
on the Unicode character properties database.
494-
495-
496521
.. data:: X
497522
VERBOSE
498523

@@ -511,6 +536,8 @@ form.
511536
b = re.compile(r"\d+\.\d*")
512537

513538

539+
540+
514541
.. function:: search(pattern, string[, flags])
515542

516543
Scan through *string* looking for a location where the regular expression

Lib/_strptime.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@
1414
import locale
1515
import calendar
1616
from re import compile as re_compile
17-
from re import IGNORECASE
17+
from re import IGNORECASE, ASCII
1818
from re import escape as re_escape
1919
from datetime import date as datetime_date
2020
try:
@@ -262,7 +262,7 @@ def pattern(self, format):
262262

263263
def compile(self, format):
264264
"""Return a compiled re object for the format string."""
265-
return re_compile(self.pattern(format), IGNORECASE)
265+
return re_compile(self.pattern(format), IGNORECASE | ASCII)
266266

267267
_cache_lock = _thread_allocate_lock()
268268
# DO NOT modify _TimeRE_cache or _regex_cache without acquiring the cache lock

Lib/base64.py

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -39,7 +39,7 @@ def _translate(s, altchars):
3939
return s.translate(translation)
4040

4141

42-
42+
4343
# Base64 encoding/decoding uses binascii
4444

4545
def b64encode(s, altchars=None):
@@ -126,7 +126,7 @@ def urlsafe_b64decode(s):
126126
return b64decode(s, b'-_')
127127

128128

129-
129+
130130
# Base32 encoding/decoding must be done in Python
131131
_b32alphabet = {
132132
0: b'A', 9: b'J', 18: b'S', 27: b'3',
@@ -225,7 +225,7 @@ def b32decode(s, casefold=False, map01=None):
225225
# characters because this will tell us how many null bytes to remove from
226226
# the end of the decoded string.
227227
padchars = 0
228-
mo = re.search('(?P<pad>[=]*)$', s)
228+
mo = re.search(b'(?P<pad>[=]*)$', s)
229229
if mo:
230230
padchars = len(mo.group('pad'))
231231
if padchars > 0:
@@ -262,7 +262,7 @@ def b32decode(s, casefold=False, map01=None):
262262
return b''.join(parts)
263263

264264

265-
265+
266266
# RFC 3548, Base 16 Alphabet specifies uppercase, but hexlify() returns
267267
# lowercase. The RFC also recommends against accepting input case
268268
# insensitively.
@@ -291,12 +291,12 @@ def b16decode(s, casefold=False):
291291
raise TypeError("expected bytes, not %s" % s.__class__.__name__)
292292
if casefold:
293293
s = s.upper()
294-
if re.search('[^0-9A-F]', s):
294+
if re.search(b'[^0-9A-F]', s):
295295
raise binascii.Error('Non-base16 digit found')
296296
return binascii.unhexlify(s)
297297

298298

299-
299+
300300
# Legacy interface. This code could be cleaned up since I don't believe
301301
# binascii has any line length limitations. It just doesn't seem worth it
302302
# though. The files should be opened in binary mode.
@@ -353,7 +353,7 @@ def decodestring(s):
353353
return binascii.a2b_base64(s)
354354

355355

356-
356+
357357
# Usable as a script...
358358
def main():
359359
"""Small main program"""

Lib/decimal.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5415,7 +5415,7 @@ def _convert_other(other, raiseit=False):
54155415
# 2. For finite numbers (not infinities and NaNs) the body of the
54165416
# number between the optional sign and the optional exponent must have
54175417
# at least one decimal digit, possibly after the decimal point. The
5418-
# lookahead expression '(?=\d|\.\d)' checks this.
5418+
# lookahead expression '(?=[0-9]|\.[0-9])' checks this.
54195419
#
54205420
# As the flag UNICODE is not enabled here, we're explicitly avoiding any
54215421
# other meaning for \d than the numbers [0-9].

Lib/distutils/cygwinccompiler.py

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -409,7 +409,7 @@ def get_versions():
409409
out = os.popen(gcc_exe + ' -dumpversion','r')
410410
out_string = out.read()
411411
out.close()
412-
result = re.search('(\d+\.\d+(\.\d+)*)',out_string)
412+
result = re.search('(\d+\.\d+(\.\d+)*)', out_string, re.ASCII)
413413
if result:
414414
gcc_version = StrictVersion(result.group(1))
415415
else:
@@ -421,7 +421,7 @@ def get_versions():
421421
out = os.popen(ld_exe + ' -v','r')
422422
out_string = out.read()
423423
out.close()
424-
result = re.search('(\d+\.\d+(\.\d+)*)',out_string)
424+
result = re.search('(\d+\.\d+(\.\d+)*)', out_string, re.ASCII)
425425
if result:
426426
ld_version = StrictVersion(result.group(1))
427427
else:
@@ -433,7 +433,7 @@ def get_versions():
433433
out = os.popen(dllwrap_exe + ' --version','r')
434434
out_string = out.read()
435435
out.close()
436-
result = re.search(' (\d+\.\d+(\.\d+)*)',out_string)
436+
result = re.search(' (\d+\.\d+(\.\d+)*)', out_string, re.ASCII)
437437
if result:
438438
dllwrap_version = StrictVersion(result.group(1))
439439
else:

Lib/distutils/emxccompiler.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -300,7 +300,7 @@ def get_versions():
300300
out = os.popen(gcc_exe + ' -dumpversion','r')
301301
out_string = out.read()
302302
out.close()
303-
result = re.search('(\d+\.\d+\.\d+)',out_string)
303+
result = re.search('(\d+\.\d+\.\d+)', out_string, re.ASCII)
304304
if result:
305305
gcc_version = StrictVersion(result.group(1))
306306
else:

Lib/distutils/sysconfig.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -512,7 +512,7 @@ def get_config_vars(*args):
512512
# patched up as well.
513513
'CFLAGS', 'PY_CFLAGS', 'BLDSHARED'):
514514
flags = _config_vars[key]
515-
flags = re.sub('-arch\s+\w+\s', ' ', flags)
515+
flags = re.sub('-arch\s+\w+\s', ' ', flags, re.ASCII)
516516
flags = re.sub('-isysroot [^ \t]*', ' ', flags)
517517
_config_vars[key] = flags
518518

Lib/distutils/util.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -81,7 +81,7 @@ def get_platform ():
8181
return "%s-%s.%s" % (osname, version, release)
8282
elif osname[:6] == "cygwin":
8383
osname = "cygwin"
84-
rel_re = re.compile (r'[\d.]+')
84+
rel_re = re.compile (r'[\d.]+', re.ASCII)
8585
m = rel_re.match(release)
8686
if m:
8787
release = m.group()

Lib/distutils/version.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -134,7 +134,7 @@ class StrictVersion (Version):
134134
"""
135135

136136
version_re = re.compile(r'^(\d+) \. (\d+) (\. (\d+))? ([ab](\d+))?$',
137-
re.VERBOSE)
137+
re.VERBOSE | re.ASCII)
138138

139139

140140
def parse (self, vstring):

Lib/distutils/versionpredicate.py

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,8 @@
55
import operator
66

77

8-
re_validPackage = re.compile(r"(?i)^\s*([a-z_]\w*(?:\.[a-z_]\w*)*)(.*)")
8+
re_validPackage = re.compile(r"(?i)^\s*([a-z_]\w*(?:\.[a-z_]\w*)*)(.*)",
9+
re.ASCII)
910
# (package) (rest)
1011

1112
re_paren = re.compile(r"^\s*\((.*)\)\s*$") # (list) inside of parentheses
@@ -153,7 +154,8 @@ def split_provision(value):
153154
global _provision_rx
154155
if _provision_rx is None:
155156
_provision_rx = re.compile(
156-
"([a-zA-Z_]\w*(?:\.[a-zA-Z_]\w*)*)(?:\s*\(\s*([^)\s]+)\s*\))?$")
157+
"([a-zA-Z_]\w*(?:\.[a-zA-Z_]\w*)*)(?:\s*\(\s*([^)\s]+)\s*\))?$",
158+
re.ASCII)
157159
value = value.strip()
158160
m = _provision_rx.match(value)
159161
if not m:

0 commit comments

Comments
 (0)