-
-
Notifications
You must be signed in to change notification settings - Fork 33.7k
gh-67041: Allow to distinguish between empty and not defined URI components #123305
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
a60c9be
a1dbfa6
b50b778
eaa9ce6
e5c31dd
78bdc13
5846bf2
b578c9d
7d59b7e
d025fa8
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -50,12 +50,16 @@ URL Parsing | |
| The URL parsing functions focus on splitting a URL string into its components, | ||
| or on combining URL components into a URL string. | ||
|
|
||
| .. function:: urlparse(urlstring, scheme='', allow_fragments=True) | ||
| .. function:: urlparse(urlstring, scheme=None, allow_fragments=True, *, missing_as_none=False) | ||
|
|
||
| Parse a URL into six components, returning a 6-item :term:`named tuple`. This | ||
| corresponds to the general structure of a URL: | ||
| ``scheme://netloc/path;parameters?query#fragment``. | ||
| Each tuple item is a string, possibly empty. The components are not broken up | ||
| Each tuple item is a string, possibly empty, or ``None`` if | ||
| *missing_as_none* is true. | ||
| Not defined component are represented an empty string (by default) or | ||
| ``None`` if *missing_as_none* is true. | ||
| The components are not broken up | ||
| into smaller parts (for example, the network location is a single string), and % | ||
| escapes are not expanded. The delimiters as shown above are not part of the | ||
| result, except for a leading slash in the *path* component, which is retained if | ||
|
|
@@ -84,6 +88,12 @@ or on combining URL components into a URL string. | |
| 80 | ||
| >>> o._replace(fragment="").geturl() | ||
| 'http://docs.python.org:80/3/library/urllib.parse.html?highlight=params' | ||
| >>> urlparse("http://docs.python.org?") | ||
| ParseResult(scheme='http', netloc='docs.python.org', | ||
| path='', params='', query='', fragment='') | ||
| >>> urlparse("http://docs.python.org?", missing_as_none=True) | ||
| ParseResult(scheme='http', netloc='docs.python.org', | ||
| path='', params=None, query='', fragment=None) | ||
|
|
||
| Following the syntax specifications in :rfc:`1808`, urlparse recognizes | ||
| a netloc only if it is properly introduced by '//'. Otherwise the | ||
|
|
@@ -101,47 +111,53 @@ or on combining URL components into a URL string. | |
| ParseResult(scheme='', netloc='', path='www.cwi.nl/%7Eguido/Python.html', | ||
| params='', query='', fragment='') | ||
| >>> urlparse('help/Python.html') | ||
| ParseResult(scheme='', netloc='', path='help/Python.html', params='', | ||
| query='', fragment='') | ||
| ParseResult(scheme='', netloc='', path='help/Python.html', | ||
| params='', query='', fragment='') | ||
| >>> urlparse('help/Python.html', missing_as_none=True) | ||
| ParseResult(scheme=None, netloc=None, path='help/Python.html', | ||
| params=None, query=None, fragment=None) | ||
|
|
||
| The *scheme* argument gives the default addressing scheme, to be | ||
| used only if the URL does not specify one. It should be the same type | ||
| (text or bytes) as *urlstring*, except that the default value ``''`` is | ||
| (text or bytes) as *urlstring* or ``None``, except that the ``''`` is | ||
| always allowed, and is automatically converted to ``b''`` if appropriate. | ||
|
|
||
| If the *allow_fragments* argument is false, fragment identifiers are not | ||
| recognized. Instead, they are parsed as part of the path, parameters | ||
| or query component, and :attr:`fragment` is set to the empty string in | ||
| the return value. | ||
| or query component, and :attr:`fragment` is set to ``None`` or the empty | ||
| string (depending on the value of *missing_as_none*) in the return value. | ||
|
|
||
| The return value is a :term:`named tuple`, which means that its items can | ||
| be accessed by index or as named attributes, which are: | ||
|
|
||
| +------------------+-------+-------------------------+------------------------+ | ||
| | Attribute | Index | Value | Value if not present | | ||
| +==================+=======+=========================+========================+ | ||
| | :attr:`scheme` | 0 | URL scheme specifier | *scheme* parameter | | ||
| +------------------+-------+-------------------------+------------------------+ | ||
| | :attr:`netloc` | 1 | Network location part | empty string | | ||
| +------------------+-------+-------------------------+------------------------+ | ||
| | :attr:`path` | 2 | Hierarchical path | empty string | | ||
| +------------------+-------+-------------------------+------------------------+ | ||
| | :attr:`params` | 3 | Parameters for last | empty string | | ||
| | | | path element | | | ||
| +------------------+-------+-------------------------+------------------------+ | ||
| | :attr:`query` | 4 | Query component | empty string | | ||
| +------------------+-------+-------------------------+------------------------+ | ||
| | :attr:`fragment` | 5 | Fragment identifier | empty string | | ||
| +------------------+-------+-------------------------+------------------------+ | ||
| | :attr:`username` | | User name | :const:`None` | | ||
| +------------------+-------+-------------------------+------------------------+ | ||
| | :attr:`password` | | Password | :const:`None` | | ||
| +------------------+-------+-------------------------+------------------------+ | ||
| | :attr:`hostname` | | Host name (lower case) | :const:`None` | | ||
| +------------------+-------+-------------------------+------------------------+ | ||
| | :attr:`port` | | Port number as integer, | :const:`None` | | ||
| | | | if present | | | ||
| +------------------+-------+-------------------------+------------------------+ | ||
| +------------------+-------+-------------------------+-------------------------------+ | ||
| | Attribute | Index | Value | Value if not present | | ||
| +==================+=======+=========================+===============================+ | ||
| | :attr:`scheme` | 0 | URL scheme specifier | *scheme* parameter or | | ||
| | | | | empty string [1]_ | | ||
| +------------------+-------+-------------------------+-------------------------------+ | ||
| | :attr:`netloc` | 1 | Network location part | ``None`` or empty string [1]_ | | ||
| +------------------+-------+-------------------------+-------------------------------+ | ||
| | :attr:`path` | 2 | Hierarchical path | empty string | | ||
| +------------------+-------+-------------------------+-------------------------------+ | ||
| | :attr:`params` | 3 | Parameters for last | ``None`` or empty string [1]_ | | ||
| | | | path element | | | ||
| +------------------+-------+-------------------------+-------------------------------+ | ||
| | :attr:`query` | 4 | Query component | ``None`` or empty string [1]_ | | ||
| +------------------+-------+-------------------------+-------------------------------+ | ||
| | :attr:`fragment` | 5 | Fragment identifier | ``None`` or empty string [1]_ | | ||
| +------------------+-------+-------------------------+-------------------------------+ | ||
| | :attr:`username` | | User name | ``None`` | | ||
| +------------------+-------+-------------------------+-------------------------------+ | ||
| | :attr:`password` | | Password | ``None`` | | ||
| +------------------+-------+-------------------------+-------------------------------+ | ||
| | :attr:`hostname` | | Host name (lower case) | ``None`` | | ||
| +------------------+-------+-------------------------+-------------------------------+ | ||
| | :attr:`port` | | Port number as integer, | ``None`` | | ||
| | | | if present | | | ||
| +------------------+-------+-------------------------+-------------------------------+ | ||
|
|
||
| .. [1] Depending on the value of the *missing_as_none* argument. | ||
|
|
||
| Reading the :attr:`port` attribute will raise a :exc:`ValueError` if | ||
| an invalid port is specified in the URL. See section | ||
|
|
@@ -187,12 +203,15 @@ or on combining URL components into a URL string. | |
|
|
||
| .. versionchanged:: 3.6 | ||
| Out-of-range port numbers now raise :exc:`ValueError`, instead of | ||
| returning :const:`None`. | ||
| returning ``None``. | ||
|
|
||
| .. versionchanged:: 3.8 | ||
| Characters that affect netloc parsing under NFKC normalization will | ||
| now raise :exc:`ValueError`. | ||
|
|
||
| .. versionchanged:: next | ||
| Added the *missing_as_none* parameter. | ||
|
|
||
|
|
||
| .. function:: parse_qs(qs, keep_blank_values=False, strict_parsing=False, encoding='utf-8', errors='replace', max_num_fields=None, separator='&') | ||
|
|
||
|
|
@@ -288,15 +307,27 @@ or on combining URL components into a URL string. | |
|
|
||
|
|
||
| .. function:: urlunparse(parts) | ||
| urlunparse(parts, *, keep_empty) | ||
|
|
||
| Construct a URL from a tuple as returned by ``urlparse()``. The *parts* | ||
| argument can be any six-item iterable. This may result in a slightly | ||
| different, but equivalent URL, if the URL that was parsed originally had | ||
| unnecessary delimiters (for example, a ``?`` with an empty query; the RFC | ||
| states that these are equivalent). | ||
| argument can be any six-item iterable. | ||
|
|
||
| This may result in a slightly different, but equivalent URL, if the | ||
| URL that was parsed originally had unnecessary delimiters (for example, | ||
| a ``?`` with an empty query; the RFC states that these are equivalent). | ||
|
|
||
| If *keep_empty* is true, empty strings are kept in the result (for example, | ||
| a ``?`` for an empty query), only ``None`` components are omitted. | ||
| This allows rebuilding a URL that was parsed with option | ||
| ``missing_as_none=True``. | ||
| By default, *keep_empty* is true if *parts* is the result of the | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It is easy to miss this "footnote" about keep_empty not being a simple False default. I think the function signature above should be more clear that it has a non-trivial default value when not explicitly specified. something awkward with a fake descriptive name but indicative that people should read further for details keep_empty=_FALSE_UNLESS_PARTS_IS_A_URLSPLIT_RESULT or similar perhaps?
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Pydoc will output |
||
| :func:`urlparse` call with ``missing_as_none=True``. | ||
|
|
||
| .. function:: urlsplit(urlstring, scheme='', allow_fragments=True) | ||
| .. versionchanged:: next | ||
| Added the *keep_empty* parameter. | ||
|
|
||
|
|
||
| .. function:: urlsplit(urlstring, scheme=None, allow_fragments=True, *, missing_as_none=False) | ||
|
|
||
| This is similar to :func:`urlparse`, but does not split the params from the URL. | ||
| This should generally be used instead of :func:`urlparse` if the more recent URL | ||
|
|
@@ -310,28 +341,31 @@ or on combining URL components into a URL string. | |
| The return value is a :term:`named tuple`, its items can be accessed by index | ||
| or as named attributes: | ||
|
|
||
| +------------------+-------+-------------------------+----------------------+ | ||
| | Attribute | Index | Value | Value if not present | | ||
| +==================+=======+=========================+======================+ | ||
| | :attr:`scheme` | 0 | URL scheme specifier | *scheme* parameter | | ||
| +------------------+-------+-------------------------+----------------------+ | ||
| | :attr:`netloc` | 1 | Network location part | empty string | | ||
| +------------------+-------+-------------------------+----------------------+ | ||
| | :attr:`path` | 2 | Hierarchical path | empty string | | ||
| +------------------+-------+-------------------------+----------------------+ | ||
| | :attr:`query` | 3 | Query component | empty string | | ||
| +------------------+-------+-------------------------+----------------------+ | ||
| | :attr:`fragment` | 4 | Fragment identifier | empty string | | ||
| +------------------+-------+-------------------------+----------------------+ | ||
| | :attr:`username` | | User name | :const:`None` | | ||
| +------------------+-------+-------------------------+----------------------+ | ||
| | :attr:`password` | | Password | :const:`None` | | ||
| +------------------+-------+-------------------------+----------------------+ | ||
| | :attr:`hostname` | | Host name (lower case) | :const:`None` | | ||
| +------------------+-------+-------------------------+----------------------+ | ||
| | :attr:`port` | | Port number as integer, | :const:`None` | | ||
| | | | if present | | | ||
| +------------------+-------+-------------------------+----------------------+ | ||
| +------------------+-------+-------------------------+-------------------------------+ | ||
| | Attribute | Index | Value | Value if not present | | ||
| +==================+=======+=========================+===============================+ | ||
| | :attr:`scheme` | 0 | URL scheme specifier | *scheme* parameter or | | ||
| | | | | empty string [1]_ | | ||
| +------------------+-------+-------------------------+-------------------------------+ | ||
| | :attr:`netloc` | 1 | Network location part | ``None`` or empty string [2]_ | | ||
| +------------------+-------+-------------------------+-------------------------------+ | ||
| | :attr:`path` | 2 | Hierarchical path | empty string | | ||
| +------------------+-------+-------------------------+-------------------------------+ | ||
| | :attr:`query` | 3 | Query component | ``None`` or empty string [2]_ | | ||
| +------------------+-------+-------------------------+-------------------------------+ | ||
| | :attr:`fragment` | 4 | Fragment identifier | ``None`` or empty string [2]_ | | ||
| +------------------+-------+-------------------------+-------------------------------+ | ||
| | :attr:`username` | | User name | ``None`` | | ||
| +------------------+-------+-------------------------+-------------------------------+ | ||
| | :attr:`password` | | Password | ``None`` | | ||
| +------------------+-------+-------------------------+-------------------------------+ | ||
| | :attr:`hostname` | | Host name (lower case) | ``None`` | | ||
| +------------------+-------+-------------------------+-------------------------------+ | ||
| | :attr:`port` | | Port number as integer, | ``None`` | | ||
| | | | if present | | | ||
| +------------------+-------+-------------------------+-------------------------------+ | ||
|
|
||
| .. [2] Depending on the value of the *missing_as_none* argument. | ||
|
|
||
| Reading the :attr:`port` attribute will raise a :exc:`ValueError` if | ||
| an invalid port is specified in the URL. See section | ||
|
|
@@ -356,7 +390,7 @@ or on combining URL components into a URL string. | |
|
|
||
| .. versionchanged:: 3.6 | ||
| Out-of-range port numbers now raise :exc:`ValueError`, instead of | ||
| returning :const:`None`. | ||
| returning ``None``. | ||
|
|
||
| .. versionchanged:: 3.8 | ||
| Characters that affect netloc parsing under NFKC normalization will | ||
|
|
@@ -368,15 +402,31 @@ or on combining URL components into a URL string. | |
| .. versionchanged:: 3.12 | ||
| Leading WHATWG C0 control and space characters are stripped from the URL. | ||
|
|
||
| .. versionchanged:: next | ||
| Added the *missing_as_none* parameter. | ||
|
|
||
| .. _WHATWG spec: https://url.spec.whatwg.org/#concept-basic-url-parser | ||
|
|
||
| .. function:: urlunsplit(parts) | ||
| urlunsplit(parts, *, keep_empty) | ||
|
|
||
| Combine the elements of a tuple as returned by :func:`urlsplit` into a | ||
| complete URL as a string. The *parts* argument can be any five-item | ||
| iterable. This may result in a slightly different, but equivalent URL, if the | ||
| URL that was parsed originally had unnecessary delimiters (for example, a ? | ||
| with an empty query; the RFC states that these are equivalent). | ||
| iterable. | ||
|
|
||
| This may result in a slightly different, but equivalent URL, if the | ||
| URL that was parsed originally had unnecessary delimiters (for example, | ||
| a ``?`` with an empty query; the RFC states that these are equivalent). | ||
|
|
||
| If *keep_empty* is true, empty strings are kept in the result (for example, | ||
| a ``?`` for an empty query), only ``None`` components are omitted. | ||
| This allows rebuilding a URL that was parsed with option | ||
| ``missing_as_none=True``. | ||
| By default, *keep_empty* is true if *parts* is the result of the | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. same comment as above. make it obvious from the function signature that the default depends on the type of parts. |
||
| :func:`urlsplit` call with ``missing_as_none=True``. | ||
|
|
||
| .. versionchanged:: next | ||
| Added the *keep_empty* parameter. | ||
|
|
||
|
|
||
| .. function:: urljoin(base, url, allow_fragments=True) | ||
|
|
@@ -422,30 +472,35 @@ or on combining URL components into a URL string. | |
| Behavior updated to match the semantics defined in :rfc:`3986`. | ||
|
|
||
|
|
||
| .. function:: urldefrag(url) | ||
| .. function:: urldefrag(url, *, missing_as_none=False) | ||
|
|
||
| If *url* contains a fragment identifier, return a modified version of *url* | ||
| with no fragment identifier, and the fragment identifier as a separate | ||
| string. If there is no fragment identifier in *url*, return *url* unmodified | ||
| and an empty string. | ||
| and an empty string (by default) or ``None`` if *missing_as_none* is true. | ||
|
|
||
| The return value is a :term:`named tuple`, its items can be accessed by index | ||
| or as named attributes: | ||
|
|
||
| +------------------+-------+-------------------------+----------------------+ | ||
| | Attribute | Index | Value | Value if not present | | ||
| +==================+=======+=========================+======================+ | ||
| | :attr:`url` | 0 | URL with no fragment | empty string | | ||
| +------------------+-------+-------------------------+----------------------+ | ||
| | :attr:`fragment` | 1 | Fragment identifier | empty string | | ||
| +------------------+-------+-------------------------+----------------------+ | ||
| +------------------+-------+-------------------------+-------------------------------+ | ||
| | Attribute | Index | Value | Value if not present | | ||
| +==================+=======+=========================+===============================+ | ||
| | :attr:`url` | 0 | URL with no fragment | empty string | | ||
| +------------------+-------+-------------------------+-------------------------------+ | ||
| | :attr:`fragment` | 1 | Fragment identifier | ``None`` or empty string [3]_ | | ||
| +------------------+-------+-------------------------+-------------------------------+ | ||
|
|
||
| .. [3] Depending on the value of the *missing_as_none* argument. | ||
|
|
||
| See section :ref:`urlparse-result-object` for more information on the result | ||
| object. | ||
|
|
||
| .. versionchanged:: 3.2 | ||
| Result is a structured object rather than a simple 2-tuple. | ||
|
|
||
| .. versionchanged:: next | ||
| Added the *missing_as_none* parameter. | ||
|
|
||
| .. function:: unwrap(url) | ||
|
|
||
| Extract the url from a wrapped URL (that is, a string formatted as | ||
|
|
@@ -465,8 +520,9 @@ URLs elsewhere. Their purpose is for practical functionality rather than | |
| purity. | ||
|
|
||
| Instead of raising an exception on unusual input, they may instead return some | ||
| component parts as empty strings. Or components may contain more than perhaps | ||
| they should. | ||
| component parts as empty strings or ``None`` (depending on the value of the | ||
| *missing_as_none* argument). | ||
| Or components may contain more than perhaps they should. | ||
|
|
||
| We recommend that users of these APIs where the values may be used anywhere | ||
| with security implications code defensively. Do some verification within your | ||
|
|
@@ -542,7 +598,8 @@ previous section, as well as an additional method: | |
| Return the re-combined version of the original URL as a string. This may | ||
| differ from the original URL in that the scheme may be normalized to lower | ||
| case and empty components may be dropped. Specifically, empty parameters, | ||
| queries, and fragment identifiers will be removed. | ||
| queries, and fragment identifiers will be removed unless the URL was parsed | ||
| with ``missing_as_none=True``. | ||
|
|
||
| For :func:`urldefrag` results, only empty fragment identifiers will be removed. | ||
| For :func:`urlsplit` and :func:`urlparse` results, all noted changes will be | ||
|
|
@@ -559,6 +616,9 @@ previous section, as well as an additional method: | |
| >>> r2 = urlsplit(r1.geturl()) | ||
| >>> r2.geturl() | ||
| 'http://www.Python.org/doc/' | ||
| >>> r3 = urlsplit(url, missing_as_none=True) | ||
| >>> r3.geturl() | ||
| 'http://www.Python.org/doc/#' | ||
|
|
||
|
|
||
| The following classes provide the implementations of the structured parse | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am really sorry, that I am contributing so late; but "missing_as_none=False" is confusing and not intuitive at all to me.
Pretty sure, others who have not participated are going to feel the same.
The function signature and term is not giving a signal on what it is meant to be.
Are you open to new name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am adding @gpshead, as one of the active developers in this area, to get his opinion too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with this name. Admittedly I have spent too much time in the past wrangling problems in this library, but the reason it still works for me despite that is that it is a common concept: do you represent the absence of a value distinctly from the base zero/empty version of that type or not? That is what None is for. and missing_as_none is at least explicit in name to indicate that some values may be None. I'm not going to call it pretty but it is "understandable enough" for me. I can't come up with anything that'd be meaningfully better rather than just alternately-understandable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are your thoughts about
missing_as_emptywith the opposite semantic? In future, None will be returned for not defined components by default, and you will need to specifymissing_as_none=Falseormissing_as_empty=Trueto restore the current behavior.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that name should really be
missing_as_empty_stringat least (we can’t say just “empty”), meaning missing parts returned as empty strings.What about
use_none=Falsewhich is short and doesn’t try to be self-explanatory, so people need to read the docs?