Skip to content

Conversation

@dantraMSFT
Copy link
Contributor

@dantraMSFT dantraMSFT commented Jul 25, 2017

Fix #3267

The change enables checking for the charset meta attribute when the Content-Type header does not define the charset. When found, the text content is re-encoded to the target charset.

A new property has been added to BasicHtmlWebResponseObject and HtmlWebResponseObject to support verifying the encoding. Attempting to do so from the returned content is non-deterministic.
See MicrosoftDocs/PowerShell-Docs#1543 for example usage.

NOTE: Three tests are pending the resolution for issue #2867

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need doc change as you're adding to public api

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HTML5 doesn't require quotes around attribute values and single quotes are also legal

@TravisEz13 TravisEz13 added the Documentation Needed in this repo Documentation is needed in this repo label Aug 3, 2017
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

between meta and charset one or more spaces are allowed not only one.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a regular expression might be better here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm worried about platforms where these encodings don't exist (linux/mac), which will cause the tests to fail. Do additional steps need to be taken to register new encodings, or does that happen elsewhere?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@dantraMSFT
Copy link
Contributor Author

Moving to regex for detecting charset. Will reopen when done.

dantraMSFT and others added 6 commits August 9, 2017 13:00
Add negative test to validate handling of invalid/unsupported encodings in StreamHelper.DecodeStream
Revise regex expression to be less restrictive of the charset attribute value and use a named group for the value.
Updated regex to handle http-equiv charset declarations
Update tests to validate http-equiv charset declarations
Strip trailing whitespace.
@dantraMSFT
Copy link
Contributor Author

@TravisEz13 and @JamesWTruher, please take a look again. I've completed the changes to use regex.

if (isDefaultEncoding) do
{
// check for a charset attribute on the meta element to override the default.
Match match = s_metaexp.Match(content);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In order to cache the regex, you need to use the static regex methods. See: https://msdn.microsoft.com/en-us/library/system.text.regularexpressions.regex(v=vs.110).aspx#static_vs_instance

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TravisEz13 - According to my reading of the page, I'm doing exactly what it says...

"To prevent recompilation, you should instantiate a single Regex object that is accessible to all code that requires it, as shown in the following rewritten example."

Since s_metaexp is static/readonly, there is only one compilation. That appears to be the point of the of the caching; avoiding recompilation. Am I missing something?

Copy link
Member

@TravisEz13 TravisEz13 Aug 11, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Under the Important note:

In the .NET Framework versions 1.0 and 1.1, all compiled regular expressions, whether they were used in instance or static method calls, were cached. Starting with the .NET Framework 2.0, only regular expressions used in static method calls are cached.`

Basically, they changed the behavior and only added a not.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My review of the corefx shows that constructing and reusing an instance (regex.Match) is equivalent to calling the static Match method. The constructor manages cache lookups/updates. The static Match method constructs a new instance and calls through the instance Match method.
The relevant static method is:
public static Match Match(string input, string pattern, RegexOptions options, TimeSpan matchTimeout)
{
return new Regex(pattern, options, matchTimeout, true).Match(input);
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sound good, thanks for looking into it.

#region charset encoding tests

Context "BasicHtmlWebResponseObject Encoding tests" {
It "Verifies Invoke-WebRequest detects charset meta value when the ContentType header does not define it." {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need to add test cases for any of the issues that caused the change to regex?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did add test cases covering variations of <meta charset...?> and , including various newline variations and unquoted attribute value.

@dantraMSFT
Copy link
Contributor Author

@TravisEz13: would you merge this. Thanks.

@TravisEz13 TravisEz13 merged commit 75a2949 into PowerShell:master Aug 16, 2017
@dantraMSFT dantraMSFT deleted the dantra/issue3267 branch August 16, 2017 23:27
@SteveL-MSFT
Copy link
Member

@chuanjiao10 since this is already merged, @dantraMSFT will submit new PR to address your issue

@dantraMSFT
Copy link
Contributor Author

@chuanjiao10, see #4692

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants