Skip to content

fix: content encoding inference from meta tag, especially for various unusual format#8031

Open
edenw97 wants to merge 13 commits intomitmproxy:mainfrom
edenw97:content-encoding-meta
Open

fix: content encoding inference from meta tag, especially for various unusual format#8031
edenw97 wants to merge 13 commits intomitmproxy:mainfrom
edenw97:content-encoding-meta

Conversation

@edenw97
Copy link
Copy Markdown

@edenw97 edenw97 commented Dec 30, 2025

fix the content encoding inference from meta tag, especially for various unusual format(unquoted, whitespace), follows https://html.spec.whatwg.org/#extracting-character-encodings-from-meta-elements

Description

this PR address multiple edge case when we infer content-encoding in the meta tag

Checklist

  • I have updated tests where applicable.
  • I have added an entry to the CHANGELOG.

Copy link
Copy Markdown
Member

@mhils mhils left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

match = meta_charset.group(2) or meta_charset.group(3)
if match:
enc = match.decode("ascii", "ignore").strip()
else:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already validated that meta_charset is not None, so can this case ever happen?

(charset="" would be an option, but then we should move the .strip() before the if as well)

Copy link
Copy Markdown
Author

@edenw97 edenw97 Jan 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, do you mean move the strip() to before the decode?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants