Skip to content

gh-152248: Reject a POSIX TZ abbreviation with non-ASCII-letter characters in pure-Python zoneinfo#152249

Open
tonghuaroot wants to merge 5 commits into
python:mainfrom
tonghuaroot:gh-152248-zoneinfo-abbr-charset
Open

gh-152248: Reject a POSIX TZ abbreviation with non-ASCII-letter characters in pure-Python zoneinfo#152249
tonghuaroot wants to merge 5 commits into
python:mainfrom
tonghuaroot:gh-152248-zoneinfo-abbr-charset

Conversation

@tonghuaroot

@tonghuaroot tonghuaroot commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

The pure-Python zoneinfo parser accepts a POSIX TZ string whose unquoted std/dst abbreviation contains characters other than ASCII letters (for example an embedded space or a non-ASCII letter), while the C implementation rejects it. The unquoted alternative in the parser regex is a negated class ([^<0-9:.+-]+) that admits anything except a few delimiters, whereas the C parse_abbr walks the unquoted form with Py_ISALPHA (ASCII letters only), as POSIX (via RFC 8536) requires for the unquoted form.

This tightens the unquoted alternative to [a-zA-Z]+, matching the C accelerator and POSIX, and leaves the quoted <...> form untouched. Every well-formed TZ string and all bundled IANA zones still parse unchanged; only the previously-accepted strings now raise ValueError.

The non-ASCII case is reachable through the public from_file path, which UTF-8-decodes the footer, so it is covered by a dedicated regression test in addition to the whitespace cases added to the shared invalid_tzstrs list.

Comment thread Lib/test/test_zoneinfo/test_zoneinfo.py Outdated
tzstr = "ABÀC3"
footer = tzstr.encode("utf-8")

def from_footer():

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can give zone_from_tzstr a new parameter for encoding rather than duplicating.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, added an encoding parameter to zone_from_tzstr and reused it. I kept this a separate method only because the C and pure errors differ (bytes repr vs decoded text), so each is matched against its own message.

Comment thread Lib/zoneinfo/_zoneinfo.py
parser_re = re.compile(
r"""
(?P<std>[^<0-9:.+-]+|<[a-zA-Z0-9+-]+>)
(?P<std>[a-zA-Z]+|<[a-zA-Z0-9+-]+>)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And I see another divergence, C accepts an empty <>. :'-(

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. The direction is the reverse of this PR though: here C is the lenient side. Its parse_abbr quoted branch has no empty check, while its own unquoted branch rejects an empty run (if (str_end == str_start) return -1;), so the pure parser is correct. Want me to fold a small C fix in here, or open a separate issue?

@StanFromIreland StanFromIreland Jun 26, 2026

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add it here, it's in the scope of POSIX TZ strings. This is actually spelled out by recent versions of the standard:

the quoting characters do not contribute to the three byte minimum length and {TZNAME_MAX} maximum length.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. The C parser now rejects an empty <>, mirroring its unquoted branch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants