Skip to content

Commit 166ebc4

Browse files
Issue python#19676: Added the "namereplace" error handler.
1 parent 6cecf68 commit 166ebc4

File tree

11 files changed

+255
-7
lines changed

11 files changed

+255
-7
lines changed

Doc/c-api/codec.rst

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -116,3 +116,8 @@ Registry API for Unicode encoding error handlers
116116
Replace the unicode encode error with backslash escapes (``\x``, ``\u`` and
117117
``\U``).
118118
119+
.. c:function:: PyObject* PyCodec_NameReplaceErrors(PyObject *exc)
120+
121+
Replace the unicode encode error with `\N{...}` escapes.
122+
123+
.. versionadded: 3.4

Doc/howto/unicode.rst

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -325,8 +325,9 @@ The *errors* parameter is the same as the parameter of the
325325
:meth:`~bytes.decode` method but supports a few more possible handlers. As well as
326326
``'strict'``, ``'ignore'``, and ``'replace'`` (which in this case
327327
inserts a question mark instead of the unencodable character), there is
328-
also ``'xmlcharrefreplace'`` (inserts an XML character reference) and
329-
``backslashreplace`` (inserts a ``\uNNNN`` escape sequence).
328+
also ``'xmlcharrefreplace'`` (inserts an XML character reference),
329+
``backslashreplace`` (inserts a ``\uNNNN`` escape sequence) and
330+
``namereplace`` (inserts a ``\N{...}`` escape sequence).
330331

331332
The following example shows the different results::
332333

@@ -346,6 +347,8 @@ The following example shows the different results::
346347
b'ꀀabcd޴'
347348
>>> u.encode('ascii', 'backslashreplace')
348349
b'\\ua000abcd\\u07b4'
350+
>>> u.encode('ascii', 'namereplace')
351+
b'\\N{YI SYLLABLE IT}abcd\\u07b4'
349352

350353
The low-level routines for registering and accessing the available
351354
encodings are found in the :mod:`codecs` module. Implementing new

Doc/library/codecs.rst

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -98,6 +98,8 @@ It defines the following functions:
9898
reference (for encoding only)
9999
* ``'backslashreplace'``: replace with backslashed escape sequences (for
100100
encoding only)
101+
* ``'namereplace'``: replace with ``\N{...}`` escape sequences (for
102+
encoding only)
101103
* ``'surrogateescape'``: on decoding, replace with code points in the Unicode
102104
Private Use Area ranging from U+DC80 to U+DCFF. These private code
103105
points will then be turned back into the same bytes when the
@@ -232,6 +234,11 @@ functions which use :func:`lookup` for the codec lookup:
232234
Implements the ``backslashreplace`` error handling (for encoding only): the
233235
unencodable character is replaced by a backslashed escape sequence.
234236

237+
.. function:: namereplace_errors(exception)
238+
239+
Implements the ``namereplace`` error handling (for encoding only): the
240+
unencodable character is replaced by a ``\N{...}`` escape sequence.
241+
235242
To simplify working with encoded files or stream, the module also defines these
236243
utility functions:
237244

@@ -363,6 +370,9 @@ and implemented by all standard Python codecs:
363370
| ``'backslashreplace'`` | Replace with backslashed escape sequences |
364371
| | (only for encoding). |
365372
+-------------------------+-----------------------------------------------+
373+
| ``'namereplace'`` | Replace with ``\N{...}`` escape sequences |
374+
| | (only for encoding). |
375+
+-------------------------+-----------------------------------------------+
366376
| ``'surrogateescape'`` | Replace byte with surrogate U+DCxx, as defined|
367377
| | in :pep:`383`. |
368378
+-------------------------+-----------------------------------------------+
@@ -384,6 +394,9 @@ schemes:
384394
.. versionchanged:: 3.4
385395
The ``'surrogatepass'`` error handlers now works with utf-16\* and utf-32\* codecs.
386396

397+
.. versionadded:: 3.4
398+
The ``'namereplace'`` error handler.
399+
387400
The set of allowed values can be extended via :meth:`register_error`.
388401

389402

@@ -477,6 +490,8 @@ define in order to be compatible with the Python codec registry.
477490

478491
* ``'backslashreplace'`` Replace with backslashed escape sequences.
479492

493+
* ``'namereplace'`` Replace with ``\N{...}`` escape sequences.
494+
480495
The *errors* argument will be assigned to an attribute of the same name.
481496
Assigning to this attribute makes it possible to switch between different error
482497
handling strategies during the lifetime of the :class:`IncrementalEncoder`
@@ -625,6 +640,8 @@ compatible with the Python codec registry.
625640

626641
* ``'backslashreplace'`` Replace with backslashed escape sequences.
627642

643+
* ``'namereplace'`` Replace with ``\N{...}`` escape sequences.
644+
628645
The *errors* argument will be assigned to an attribute of the same name.
629646
Assigning to this attribute makes it possible to switch between different error
630647
handling strategies during the lifetime of the :class:`StreamWriter` object.

Doc/library/functions.rst

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -975,6 +975,9 @@ are always available. They are listed here in alphabetical order.
975975
replaces unsupported characters with Python's backslashed escape
976976
sequences.
977977

978+
* ``'namereplace'`` (also only supported when writing)
979+
replaces unsupported characters with ``\N{...}`` escape sequences.
980+
978981
.. index::
979982
single: universal newlines; open() built-in function
980983

Doc/library/io.rst

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -827,9 +827,10 @@ Text I/O
827827
errors can lead to data loss.) ``'replace'`` causes a replacement marker
828828
(such as ``'?'``) to be inserted where there is malformed data. When
829829
writing, ``'xmlcharrefreplace'`` (replace with the appropriate XML character
830-
reference) or ``'backslashreplace'`` (replace with backslashed escape
831-
sequences) can be used. Any other error handling name that has been
832-
registered with :func:`codecs.register_error` is also valid.
830+
reference), ``'backslashreplace'`` (replace with backslashed escape
831+
sequences) or ``'namereplace'`` (replace with ``\N{...}`` escape sequences)
832+
can be used. Any other error handling name that has been registered with
833+
:func:`codecs.register_error` is also valid.
833834

834835
.. index::
835836
single: universal newlines; io.TextIOWrapper class

Include/codecs.h

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -225,6 +225,9 @@ PyAPI_FUNC(PyObject *) PyCodec_XMLCharRefReplaceErrors(PyObject *exc);
225225
/* replace the unicode encode error with backslash escapes (\x, \u and \U) */
226226
PyAPI_FUNC(PyObject *) PyCodec_BackslashReplaceErrors(PyObject *exc);
227227

228+
/* replace the unicode encode error with backslash escapes (\N, \x, \u and \U) */
229+
PyAPI_FUNC(PyObject *) PyCodec_NameReplaceErrors(PyObject *exc);
230+
228231
PyAPI_DATA(const char *) Py_hexdigits;
229232

230233
#ifdef __cplusplus

Lib/codecs.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,7 @@
2222
"BOM_UTF32", "BOM_UTF32_LE", "BOM_UTF32_BE",
2323
"strict_errors", "ignore_errors", "replace_errors",
2424
"xmlcharrefreplace_errors",
25+
"backslashreplace_errors", "namereplace_errors",
2526
"register_error", "lookup_error"]
2627

2728
### Constants
@@ -1085,13 +1086,15 @@ def make_encoding_map(decoding_map):
10851086
replace_errors = lookup_error("replace")
10861087
xmlcharrefreplace_errors = lookup_error("xmlcharrefreplace")
10871088
backslashreplace_errors = lookup_error("backslashreplace")
1089+
namereplace_errors = lookup_error("namereplace")
10881090
except LookupError:
10891091
# In --disable-unicode builds, these error handler are missing
10901092
strict_errors = None
10911093
ignore_errors = None
10921094
replace_errors = None
10931095
xmlcharrefreplace_errors = None
10941096
backslashreplace_errors = None
1097+
namereplace_errors = None
10951098

10961099
# Tell modulefinder that using codecs probably needs the encodings
10971100
# package

Lib/test/test_codeccallbacks.py

Lines changed: 98 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -158,6 +158,22 @@ def test_backslashescape(self):
158158
sout = b"a\xac\\u1234\xa4\\u8000\\U0010ffff"
159159
self.assertEqual(sin.encode("iso-8859-15", "backslashreplace"), sout)
160160

161+
def test_nameescape(self):
162+
# Does the same as backslashescape, but prefers ``\N{...}`` escape
163+
# sequences.
164+
sin = "a\xac\u1234\u20ac\u8000\U0010ffff"
165+
sout = (b'a\\N{NOT SIGN}\\N{ETHIOPIC SYLLABLE SEE}\\N{EURO SIGN}'
166+
b'\\N{CJK UNIFIED IDEOGRAPH-8000}\\U0010ffff')
167+
self.assertEqual(sin.encode("ascii", "namereplace"), sout)
168+
169+
sout = (b'a\xac\\N{ETHIOPIC SYLLABLE SEE}\\N{EURO SIGN}'
170+
b'\\N{CJK UNIFIED IDEOGRAPH-8000}\\U0010ffff')
171+
self.assertEqual(sin.encode("latin-1", "namereplace"), sout)
172+
173+
sout = (b'a\xac\\N{ETHIOPIC SYLLABLE SEE}\xa4'
174+
b'\\N{CJK UNIFIED IDEOGRAPH-8000}\\U0010ffff')
175+
self.assertEqual(sin.encode("iso-8859-15", "namereplace"), sout)
176+
161177
def test_decoding_callbacks(self):
162178
# This is a test for a decoding callback handler
163179
# that allows the decoding of the invalid sequence
@@ -297,7 +313,7 @@ def handler2(exc):
297313
def test_longstrings(self):
298314
# test long strings to check for memory overflow problems
299315
errors = [ "strict", "ignore", "replace", "xmlcharrefreplace",
300-
"backslashreplace"]
316+
"backslashreplace", "namereplace"]
301317
# register the handlers under different names,
302318
# to prevent the codec from recognizing the name
303319
for err in errors:
@@ -611,6 +627,81 @@ def test_badandgoodbackslashreplaceexceptions(self):
611627
("\\udfff", 1)
612628
)
613629

630+
def test_badandgoodnamereplaceexceptions(self):
631+
# "namereplace" complains about a non-exception passed in
632+
self.assertRaises(
633+
TypeError,
634+
codecs.namereplace_errors,
635+
42
636+
)
637+
# "namereplace" complains about the wrong exception types
638+
self.assertRaises(
639+
TypeError,
640+
codecs.namereplace_errors,
641+
UnicodeError("ouch")
642+
)
643+
# "namereplace" can only be used for encoding
644+
self.assertRaises(
645+
TypeError,
646+
codecs.namereplace_errors,
647+
UnicodeDecodeError("ascii", bytearray(b"\xff"), 0, 1, "ouch")
648+
)
649+
self.assertRaises(
650+
TypeError,
651+
codecs.namereplace_errors,
652+
UnicodeTranslateError("\u3042", 0, 1, "ouch")
653+
)
654+
# Use the correct exception
655+
self.assertEqual(
656+
codecs.namereplace_errors(
657+
UnicodeEncodeError("ascii", "\u3042", 0, 1, "ouch")),
658+
("\\N{HIRAGANA LETTER A}", 1)
659+
)
660+
self.assertEqual(
661+
codecs.namereplace_errors(
662+
UnicodeEncodeError("ascii", "\x00", 0, 1, "ouch")),
663+
("\\x00", 1)
664+
)
665+
self.assertEqual(
666+
codecs.namereplace_errors(
667+
UnicodeEncodeError("ascii", "\xff", 0, 1, "ouch")),
668+
("\\N{LATIN SMALL LETTER Y WITH DIAERESIS}", 1)
669+
)
670+
self.assertEqual(
671+
codecs.namereplace_errors(
672+
UnicodeEncodeError("ascii", "\u0100", 0, 1, "ouch")),
673+
("\\N{LATIN CAPITAL LETTER A WITH MACRON}", 1)
674+
)
675+
self.assertEqual(
676+
codecs.namereplace_errors(
677+
UnicodeEncodeError("ascii", "\uffff", 0, 1, "ouch")),
678+
("\\uffff", 1)
679+
)
680+
if SIZEOF_WCHAR_T > 0:
681+
self.assertEqual(
682+
codecs.namereplace_errors(
683+
UnicodeEncodeError("ascii", "\U00010000",
684+
0, 1, "ouch")),
685+
("\\N{LINEAR B SYLLABLE B008 A}", 1)
686+
)
687+
self.assertEqual(
688+
codecs.namereplace_errors(
689+
UnicodeEncodeError("ascii", "\U0010ffff",
690+
0, 1, "ouch")),
691+
("\\U0010ffff", 1)
692+
)
693+
# Lone surrogates (regardless of unicode width)
694+
self.assertEqual(
695+
codecs.namereplace_errors(
696+
UnicodeEncodeError("ascii", "\ud800", 0, 1, "ouch")),
697+
("\\ud800", 1)
698+
)
699+
self.assertEqual(
700+
codecs.namereplace_errors(
701+
UnicodeEncodeError("ascii", "\udfff", 0, 1, "ouch")),
702+
("\\udfff", 1)
703+
)
704+
614705
def test_badhandlerresults(self):
615706
results = ( 42, "foo", (1,2,3), ("foo", 1, 3), ("foo", None), ("foo",), ("foo", 1, 3), ("foo", None), ("foo",) )
616707
encs = ("ascii", "latin-1", "iso-8859-1", "iso-8859-15")
@@ -651,6 +742,10 @@ def test_lookup(self):
651742
codecs.backslashreplace_errors,
652743
codecs.lookup_error("backslashreplace")
653744
)
745+
self.assertEqual(
746+
codecs.namereplace_errors,
747+
codecs.lookup_error("namereplace")
748+
)
654749

655750
def test_unencodablereplacement(self):
656751
def unencrepl(exc):
@@ -804,7 +899,8 @@ def badencodereturn2(exc):
804899
class D(dict):
805900
def __getitem__(self, key):
806901
raise ValueError
807-
for err in ("strict", "replace", "xmlcharrefreplace", "backslashreplace", "test.posreturn"):
902+
for err in ("strict", "replace", "xmlcharrefreplace",
903+
"backslashreplace", "namereplace", "test.posreturn"):
808904
self.assertRaises(UnicodeError, codecs.charmap_encode, "\xff", err, {0xff: None})
809905
self.assertRaises(ValueError, codecs.charmap_encode, "\xff", err, D())
810906
self.assertRaises(TypeError, codecs.charmap_encode, "\xff", err, {0xff: 300})

Lib/test/test_codecs.py

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -349,6 +349,8 @@ def test_lone_surrogates(self):
349349
self.assertRaises(UnicodeEncodeError, "\ud800".encode, self.encoding)
350350
self.assertEqual("[\uDC80]".encode(self.encoding, "backslashreplace"),
351351
"[\\udc80]".encode(self.encoding))
352+
self.assertEqual("[\uDC80]".encode(self.encoding, "namereplace"),
353+
"[\\udc80]".encode(self.encoding))
352354
self.assertEqual("[\uDC80]".encode(self.encoding, "xmlcharrefreplace"),
353355
"[�]".encode(self.encoding))
354356
self.assertEqual("[\uDC80]".encode(self.encoding, "ignore"),
@@ -808,6 +810,7 @@ def test_encode(self):
808810
('\udc80', 'ignore', b''),
809811
('\udc80', 'replace', b'?'),
810812
('\udc80', 'backslashreplace', b'\\udc80'),
813+
('\udc80', 'namereplace', b'\\udc80'),
811814
('\udc80', 'surrogatepass', b'\xed\xb2\x80'),
812815
))
813816
else:
@@ -869,6 +872,8 @@ def test_lone_surrogates(self):
869872
self.assertRaises(UnicodeDecodeError, b"\xed\xa0\x80".decode, "cp65001")
870873
self.assertEqual("[\uDC80]".encode("cp65001", "backslashreplace"),
871874
b'[\\udc80]')
875+
self.assertEqual("[\uDC80]".encode("cp65001", "namereplace"),
876+
b'[\\udc80]')
872877
self.assertEqual("[\uDC80]".encode("cp65001", "xmlcharrefreplace"),
873878
b'[�]')
874879
self.assertEqual("[\uDC80]".encode("cp65001", "surrogateescape"),
@@ -2824,6 +2829,8 @@ def test_cp932(self):
28242829
('[\xff]', 'replace', b'[y]'),
28252830
('[\u20ac]', 'replace', b'[?]'),
28262831
('[\xff]', 'backslashreplace', b'[\\xff]'),
2832+
('[\xff]', 'namereplace',
2833+
b'[\\N{LATIN SMALL LETTER Y WITH DIAERESIS}]'),
28272834
('[\xff]', 'xmlcharrefreplace', b'[ÿ]'),
28282835
('\udcff', 'strict', None),
28292836
('[\udcff]', 'surrogateescape', b'[\xff]'),

Misc/NEWS

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -191,6 +191,8 @@ Core and Builtins
191191
Library
192192
-------
193193

194+
- Issue #19676: Added the "namereplace" error handler.
195+
194196
- Issue #22788: Add *context* parameter to logging.handlers.HTTPHandler.
195197

196198
- Issue #22921: Allow SSLContext to take the *hostname* parameter even if

0 commit comments

Comments
 (0)