Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -14,16 +14,27 @@ namespace Microsoft.PowerShell.Commands
/// <summary>
/// Response object for html content without DOM parsing.
/// </summary>
public class BasicHtmlWebResponseObject : WebResponseObject
public partial class BasicHtmlWebResponseObject : WebResponseObject
{
#region Private Fields

private static Regex s_attribNameValueRegex;
private static Regex s_attribsRegex;
private static Regex s_imageRegex;
private static Regex s_inputFieldRegex;
private static Regex s_linkRegex;
private static Regex s_tagRegex;
[GeneratedRegex(@"([^""'>/=\s\p{Cc}]+)(?:\s*=\s*(?:""(.*?)""|'(.*?)'|([^'"">\s]+)))?", RegexOptions.Singleline | RegexOptions.IgnoreCase)]
private static partial Regex s_attribNameValueRegex();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see the point of this change. The original Regex is compiled and works just as fast, but it is more compact.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks to this change we can remove EnsureHtmlParser();

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And what are the benefits? The final code is still noticeably larger, but the performance does not increase.

Copy link
Contributor Author

@CarloToso CarloToso Mar 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will revert the regex stuff, move the remaining changes to the other webcmdlet cleanup PR and close this PR

Copy link

@PaulusParssinen PaulusParssinen Mar 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's significant difference between RegexOptions.Compiled and GeneratedRegex. Regex Source Generator will not only reduce startup time but also increase performance.
I can't simply put it into words how much better generated regex is, thankfully Stephen Toub has great post about them here https://devblogs.microsoft.com/dotnet/regular-expression-improvements-in-dotnet-7/#source-generation (In fact, .NET team wants people to switch to GeneratedRegex so bad that they want to intercept every new Regex(..) call and delegate them to source generated one instead)

Stephen also talks about them here https://www.youtube.com/live/rwfNDyBBgks?t=1549
Nick Chapsas has also a quick summary of them https://www.youtube.com/watch?v=WosEhlHATOk

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't simply put it into words how much better generated regex is,

When it comes to code performance neither words nor emotions work - you have to prove with tests that a particular code will become smaller and faster.

In this particular case, the code will only get bigger because PowerShell is dynamic and can never completely get rid of the Regex compiler. The initial improvements won't be noticeable since the initialization of the cmdlet takes a significant amount of time.

Where we would like Regex SG and where it is justified is the pwsh startup scenario. And there are indeed places where this SG can be applied.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@iSazonov if you pointed me to such cases I could try to apply Regex SG

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When it comes to code performance neither words nor emotions work - you have to prove with tests that a particular code will become smaller and faster.

You're correct here. I should have provided data for my claims.

In this particular case, the code will only get bigger because PowerShell is dynamic and can never completely get rid of the Regex compiler. The initial improvements won't be noticeable since the initialization of the cmdlet takes a significant amount of time.

Where we would like Regex SG and where it is justified is the pwsh startup scenario. And there are indeed places where this SG can be applied.

I didn't realize this code wasn't used in the pwsh itself, I admit my mistake. In this case delaying parsing and doing all the complex IL emitting is justified, if this code is not called in a cmdlet.

This also makes me want to compare the size of the IL emitted for RegexOptions.Compiled and the IL compiled by Roslyn for the source generated one.. interesting.


[GeneratedRegex(@"(?<=\s+)([^""'>/=\s\p{Cc}]+(\s*=\s*(?:"".*?""|'.*?'|[^'"">\s]+))?)", RegexOptions.Singleline | RegexOptions.IgnoreCase)]
private static partial Regex s_attribsRegex();

[GeneratedRegex(@"<img\s[^>]*?>", RegexOptions.Singleline | RegexOptions.IgnoreCase)]
private static partial Regex s_imageRegex();

[GeneratedRegex(@"<input\s+[^>]*(/?>|>.*?</input>)", RegexOptions.Singleline | RegexOptions.IgnoreCase)]
private static partial Regex s_inputFieldRegex();

[GeneratedRegex(@"<a\s+[^>]*(/>|>.*?</a>)", RegexOptions.Singleline | RegexOptions.IgnoreCase)]
private static partial Regex s_linkRegex();

[GeneratedRegex(@"<\w+((\s+[^""'>/=\s\p{Cc}]+(\s*=\s*(?:"".*?""|'.*?'|[^'"">\s]+))?)+\s*|\s*)/?>", RegexOptions.Singleline | RegexOptions.IgnoreCase)]
private static partial Regex s_tagRegex();

#endregion Private Fields

Expand All @@ -43,7 +54,6 @@ public BasicHtmlWebResponseObject(HttpResponseMessage response) : this(response,
/// <param name="contentStream"></param>
public BasicHtmlWebResponseObject(HttpResponseMessage response, Stream contentStream) : base(response, contentStream)
{
EnsureHtmlParser();
InitializeContent();
InitializeRawContent(response);
}
Expand Down Expand Up @@ -82,10 +92,8 @@ public WebCmdletElementCollection InputFields
{
if (_inputFields == null)
{
EnsureHtmlParser();

List<PSObject> parsedFields = new();
MatchCollection fieldMatch = s_inputFieldRegex.Matches(Content);
MatchCollection fieldMatch = s_inputFieldRegex().Matches(Content);
foreach (Match field in fieldMatch)
{
parsedFields.Add(CreateHtmlObject(field.Value, "INPUT"));
Expand All @@ -109,10 +117,8 @@ public WebCmdletElementCollection Links
{
if (_links == null)
{
EnsureHtmlParser();

List<PSObject> parsedLinks = new();
MatchCollection linkMatch = s_linkRegex.Matches(Content);
MatchCollection linkMatch = s_linkRegex().Matches(Content);
foreach (Match link in linkMatch)
{
parsedLinks.Add(CreateHtmlObject(link.Value, "A"));
Expand All @@ -136,10 +142,8 @@ public WebCmdletElementCollection Images
{
if (_images == null)
{
EnsureHtmlParser();

List<PSObject> parsedImages = new();
MatchCollection imageMatch = s_imageRegex.Matches(Content);
MatchCollection imageMatch = s_imageRegex().Matches(Content);
foreach (Match image in imageMatch)
{
parsedImages.Add(CreateHtmlObject(image.Value, "IMG"));
Expand Down Expand Up @@ -167,11 +171,6 @@ protected void InitializeContent()
// Fill the Content buffer
string characterSet = WebResponseHelper.GetCharacterSet(BaseResponse);

if (string.IsNullOrEmpty(characterSet) && ContentHelper.IsJson(contentType))
{
characterSet = Encoding.UTF8.HeaderName;
}

Content = StreamHelper.DecodeStream(RawContentStream, characterSet, out Encoding encoding);
Encoding = encoding;
}
Expand All @@ -193,32 +192,11 @@ private static PSObject CreateHtmlObject(string html, string tagName)
return elementObject;
}

private static void EnsureHtmlParser()
{
s_tagRegex ??= new Regex(@"<\w+((\s+[^""'>/=\s\p{Cc}]+(\s*=\s*(?:"".*?""|'.*?'|[^'"">\s]+))?)+\s*|\s*)/?>",
RegexOptions.Singleline | RegexOptions.IgnoreCase | RegexOptions.Compiled);

s_attribsRegex ??= new Regex(@"(?<=\s+)([^""'>/=\s\p{Cc}]+(\s*=\s*(?:"".*?""|'.*?'|[^'"">\s]+))?)",
RegexOptions.Singleline | RegexOptions.IgnoreCase | RegexOptions.Compiled);

s_attribNameValueRegex ??= new Regex(@"([^""'>/=\s\p{Cc}]+)(?:\s*=\s*(?:""(.*?)""|'(.*?)'|([^'"">\s]+)))?",
RegexOptions.Singleline | RegexOptions.IgnoreCase | RegexOptions.Compiled);

s_inputFieldRegex ??= new Regex(@"<input\s+[^>]*(/?>|>.*?</input>)",
RegexOptions.Singleline | RegexOptions.IgnoreCase | RegexOptions.Compiled);

s_linkRegex ??= new Regex(@"<a\s+[^>]*(/>|>.*?</a>)",
RegexOptions.Singleline | RegexOptions.IgnoreCase | RegexOptions.Compiled);

s_imageRegex ??= new Regex(@"<img\s[^>]*?>",
RegexOptions.Singleline | RegexOptions.IgnoreCase | RegexOptions.Compiled);
}

private void InitializeRawContent(HttpResponseMessage baseResponse)
{
StringBuilder raw = ContentHelper.GetRawContentHeader(baseResponse);
raw.Append(Content);
this.RawContent = raw.ToString();
RawContent = raw.ToString();
}

private static void ParseAttributes(string outerHtml, PSObject elementObject)
Expand All @@ -228,16 +206,16 @@ private static void ParseAttributes(string outerHtml, PSObject elementObject)
{
// Extract just the opening tag of the HTML element (omitting the closing tag and any contents,
// including contained HTML elements)
var match = s_tagRegex.Match(outerHtml);
Match match = s_tagRegex().Match(outerHtml);

// Extract all the attribute specifications within the HTML element opening tag
var attribMatches = s_attribsRegex.Matches(match.Value);
MatchCollection attribMatches = s_attribsRegex().Matches(match.Value);

foreach (Match attribMatch in attribMatches)
{
// Extract the name and value for this attribute (allowing for variations like single/double/no
// quotes, and no value at all)
var nvMatches = s_attribNameValueRegex.Match(attribMatch.Value);
Match nvMatches = s_attribNameValueRegex().Match(attribMatch.Value);
Debug.Assert(nvMatches.Groups.Count == 5);

// Name is always captured by group #1
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -247,7 +247,7 @@ private void Initialize()
}
}

internal static class StreamHelper
internal static partial class StreamHelper
{
#region Constants

Expand Down Expand Up @@ -383,14 +383,7 @@ private static string StreamToString(Stream stream, Encoding encoding)

internal static string DecodeStream(Stream stream, string characterSet, out Encoding encoding)
{
try
{
encoding = Encoding.GetEncoding(characterSet);
}
catch (ArgumentException)
{
encoding = null;
}
TryGetEncoding(characterSet, out encoding);

return DecodeStream(stream, ref encoding);
}
Expand All @@ -411,15 +404,11 @@ internal static bool TryGetEncoding(string characterSet, out Encoding encoding)
return result;
}

private static readonly Regex s_metaRegex = new(
@"<meta\s.*[^.><]*charset\s*=\s*[""'\n]?(?<charset>[A-Za-z].[^\s""'\n<>]*)[\s""'\n>]",
RegexOptions.Compiled | RegexOptions.Singleline | RegexOptions.ExplicitCapture | RegexOptions.CultureInvariant | RegexOptions.IgnoreCase | RegexOptions.NonBacktracking
);
[GeneratedRegex(@"<meta\s.*[^.><]*charset\s*=\s*[""'\n]?(?<charset>[A-Za-z].[^\s""'\n<>]*)[\s""'\n>]", RegexOptions.Singleline | RegexOptions.ExplicitCapture | RegexOptions.CultureInvariant | RegexOptions.IgnoreCase | RegexOptions.NonBacktracking)]
private static partial Regex s_metaRegex();

private static readonly Regex s_xmlRegex = new(
@"<\?xml\s.*[^.><]*encoding\s*=\s*[""'\n]?(?<charset>[A-Za-z].[^\s""'\n<>]*)[\s""'\n>]",
RegexOptions.Compiled | RegexOptions.Singleline | RegexOptions.ExplicitCapture | RegexOptions.CultureInvariant | RegexOptions.IgnoreCase | RegexOptions.NonBacktracking
);
[GeneratedRegex(@"<\?xml\s.*[^.><]*encoding\s*=\s*[""'\n]?(?<charset>[A-Za-z].[^\s""'\n<>]*)[\s""'\n>]", RegexOptions.Singleline | RegexOptions.ExplicitCapture | RegexOptions.CultureInvariant | RegexOptions.IgnoreCase | RegexOptions.NonBacktracking)]
private static partial Regex s_xmlRegex();

internal static string DecodeStream(Stream stream, ref Encoding encoding)
{
Expand All @@ -439,12 +428,12 @@ internal static string DecodeStream(Stream stream, ref Encoding encoding)
string substring = content.Substring(0, Math.Min(content.Length, 1024));

// Check for a charset attribute on the meta element to override the default
Match match = s_metaRegex.Match(substring);
Match match = s_metaRegex().Match(substring);

// Check for a encoding attribute on the xml declaration to override the default
if (!match.Success)
{
match = s_xmlRegex.Match(substring);
match = s_xmlRegex().Match(substring);
}

if (match.Success)
Expand Down