-
Notifications
You must be signed in to change notification settings - Fork 8.1k
Use HTML meta charset attribute value, if present, when the Context-Type header does not specify it. #4338
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need doc change as you're adding to public api
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
HTML5 doesn't require quotes around attribute values and single quotes are also legal
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
between meta and charset one or more spaces are allowed not only one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a regular expression might be better here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm worried about platforms where these encodings don't exist (linux/mac), which will cause the tests to fail. Do additional steps need to be taken to register new encodings, or does that happen elsewhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
|
Moving to regex for detecting charset. Will reopen when done. |
…ype header does not specify it.
Add negative test to validate handling of invalid/unsupported encodings in StreamHelper.DecodeStream
|
@TravisEz13 and @JamesWTruher, please take a look again. I've completed the changes to use regex. |
| if (isDefaultEncoding) do | ||
| { | ||
| // check for a charset attribute on the meta element to override the default. | ||
| Match match = s_metaexp.Match(content); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In order to cache the regex, you need to use the static regex methods. See: https://msdn.microsoft.com/en-us/library/system.text.regularexpressions.regex(v=vs.110).aspx#static_vs_instance
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@TravisEz13 - According to my reading of the page, I'm doing exactly what it says...
"To prevent recompilation, you should instantiate a single Regex object that is accessible to all code that requires it, as shown in the following rewritten example."
Since s_metaexp is static/readonly, there is only one compilation. That appears to be the point of the of the caching; avoiding recompilation. Am I missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Under the Important note:
In the .NET Framework versions 1.0 and 1.1, all compiled regular expressions, whether they were used in instance or static method calls, were cached. Starting with the .NET Framework 2.0, only regular expressions used in static method calls are cached.`
Basically, they changed the behavior and only added a not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My review of the corefx shows that constructing and reusing an instance (regex.Match) is equivalent to calling the static Match method. The constructor manages cache lookups/updates. The static Match method constructs a new instance and calls through the instance Match method.
The relevant static method is:
public static Match Match(string input, string pattern, RegexOptions options, TimeSpan matchTimeout)
{
return new Regex(pattern, options, matchTimeout, true).Match(input);
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sound good, thanks for looking into it.
| #region charset encoding tests | ||
|
|
||
| Context "BasicHtmlWebResponseObject Encoding tests" { | ||
| It "Verifies Invoke-WebRequest detects charset meta value when the ContentType header does not define it." { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you need to add test cases for any of the issues that caused the change to regex?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did add test cases covering variations of <meta charset...?> and , including various newline variations and unquoted attribute value.
|
@TravisEz13: would you merge this. Thanks. |
|
@chuanjiao10 since this is already merged, @dantraMSFT will submit new PR to address your issue |
|
@chuanjiao10, see #4692 |
Fix #3267
The change enables checking for the charset meta attribute when the Content-Type header does not define the charset. When found, the text content is re-encoded to the target charset.
A new property has been added to BasicHtmlWebResponseObject and HtmlWebResponseObject to support verifying the encoding. Attempting to do so from the returned content is non-deterministic.
See MicrosoftDocs/PowerShell-Docs#1543 for example usage.
NOTE: Three tests are pending the resolution for issue #2867