Skip to content

Conversation

@iSazonov
Copy link
Collaborator

@iSazonov iSazonov commented Nov 22, 2020

PR Summary

The PR does an optimization of IEnumerable variant of replace operator:
$ABigList -replace "search pattern", <substitution>.

Before the change we evaluate the substitution for every element from $ABigList that can be very expensive if the list is large.
If the substitution is a string we do no extra operations.
But if the substitution is:

  • a custom type with defined custom type converter from the custom type to MatchEvaluator class
  • falling in last resort conversion to string with PSObject.ToStringParser() (see ReplaceOperatorImpl() method)

evaluating the substitution for every element becomes expensive.

PSObject.ToStringParser() is PowerShell magic and it is important to have it fast.
Conversion to MatchEvaluator looks like an edge case but it is before PSObject.ToStringParser() so we should optimize the case too.

The PR does simple optimization - after first evaluation the substitution is stored in a variable and reused for follow elements of the list.

Follow demo script takes 5 seconds before the change and 1 second after:

class A {
    [string] ToString() { Sleep 1; return "A" }
}

$a = [A]::new()
"B1","B2","B3","B4","B5" -replace "B", $a

PR Context

Follow #14128 and #14087.

PR Checklist

@iSazonov iSazonov added Documentation Needed in this repo Documentation is needed in this repo CL-Performance Indicates that a PR should be marked as a performance improvement in the Change Log labels Nov 22, 2020
@iSazonov iSazonov requested a review from daxian-dbw November 22, 2020 18:46
@ghost ghost added the Review - Needed The PR is being reviewed label Nov 30, 2020
@ghost
Copy link

ghost commented Nov 30, 2020

This pull request has been automatically marked as Review Needed because it has been there has not been any activity for 7 days.
Maintainer, please provide feedback and/or mark it as Waiting on Author

string replacement = PSObject.ToStringParser(context, substitute);
return regex.Replace(input, replacement);
cachedReplacementString = PSObject.ToStringParser(context, substitute);
return regex.Replace(input, cachedReplacementString);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be better if we can reduce the return statements. How about the following?

            if (cachedReplacementString is null && cachedMatchEvaluator is null)
            {
                switch (substitute)
                {
                    case string replacement:
                        cachedReplacementString = replacement;
                        break;

                    case ScriptBlock sb:
                        cachedMatchEvaluator = GetMatchEvaluator(context, sb);
                        break;

                    case object val when LanguagePrimitives.TryConvertTo(val, out cachedMatchEvaluator):
                        break;

                    default:
                        cachedReplacementString = PSObject.ToStringParser(context, substitute);
                        break;
                }
            }

            if (cachedReplacementString is not null)
            {
                return regex.Replace(input, cachedReplacementString);
            }
            else
            {
                return regex.Replace(input, cachedMatchEvaluator);
            }

Copy link
Collaborator Author

@iSazonov iSazonov Dec 3, 2020

Copy link
Collaborator

@powercode powercode Dec 3, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or

            if (cachedReplacementString is null && cachedMatchEvaluator is null)
            {
                (cachedReplacementString, cachedMatchEvaluator) = substitute switch
                {
                    string replacement => (replacement, (MatchEvaluator) null),
                    ScriptBlock sb => (null, GetMatchEvaluator(context, sb)),
                    { } val when LanguagePrimitives.TryConvertTo(val, out MatchEvaluator me) => (null, me),
                    _ => (PSObject.ToStringParser(context, substitute), null),
                };
            }
            // After first call we will have necessarily either cachedReplacementString or cachedMatchEvaluator.
            return cachedReplacementString is { } replacementString
                ? regex.Replace(input, replacementString)
                : regex.Replace(input, cachedMatchEvaluator);

@rjmholt rjmholt changed the title Opimize IEnumerable variant of replace operator Optimize IEnumerable variant of replace operator Dec 3, 2020
Copy link
Collaborator

@rjmholt rjmholt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR doesn't seem to add any tests... Do we know that this code is tested?

Comment on lines 1018 to 1027
// After first call we will have necessarily either cachedReplacementString or cachedMatchEvaluator.
if (cachedReplacementString is not null)
{
return regex.Replace(input, cachedReplacementString);
}

if (cachedMatchEvaluator is not null)
{
return regex.Replace(input, cachedMatchEvaluator);
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand the semantics here.

Let's say we have an IEnumerable of two strings. On the first string we hit line 1031, and cache that string replacement.

On the second string, it looks like we have a cached value from the first call, so we ignore the second string and immediately use the first again?

How does that work?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replacement string is not input string - I guess you lost this.

Copy link
Collaborator

@rjmholt rjmholt Dec 3, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see. Ok, so why on earth did the original implementation (i.e. before your change) do it like this? It's pretty easy to convert the substitute to a known result ahead of time without complicated caching logic. Once again, we've very seriously forgotten how to use object-oriented programming.

I think we need to:

  • Create the substitute object once, above the loop
  • Then pass it into the ReplaceOperatorImpl method

That's much much simpler than anything else.

private abstract class RegexReplacer
{
    protected RegexReplacer(Regex regex)
    {
        ReplacementRegex = regex;
    }

    public abstract string Replace(string input);

    protected Regex ReplacementRegex { get; }
}

private class StringReplacer : RegexReplacer
{
    private readonly string _replacementString;

    public StringReplacer(Regex regex, string replacement)
        : base(regex)
    {
        _replacementString = str;
    }

    public override string Replace(string input)
    {
        return ReplacementRegex.Replace(input, _replacementString);
    }
}

private class MatchEvaluatorReplacer : RegexReplacer
{
    private readonly MatchEvaluator _replacementMatchEvaluator;

    public MatchEvaluatorReplacer(Regex regex, MatchEvaluator replacement)
        : base(regex)
    {
        _replacementMatchEvaluator = replacement;
    }

    public override string Replace(string input)
    {
        return ReplacementRegex.Replace(input, _replacementMatchEvaluator);
    }
}

...

private static RegexReplacer GetSubstituteReplacer(
    ExecutionContext context,
    Regex regex,
    object substitute)
{
    switch (substitute)
    {
        // When the substitution is a string, we can skip all the conversion logic
        case string replacement:
            return new StringReplacer(regex, replacement);

        // ScriptBlocks implement custom regex evaluators, so we must handle them specially
        case ScriptBlock sb:
            return new MatchEvaluatorReplacer(regex, GetMatchEvaluator(context, sb));
    }

    // Try once more to convert the substitute to a match evaluator
    if (LanguagePrimitives.TryConvertTo(subtitute, out MatchEvaluator matchEvaluator))
    {
        return new MatchEvaluatorReplacer(regex, matchEvaluator);
    }

    // Finally, we just treat the substitute as a string
    return new StringReplacer(regex, PSObject.ToStringParser(context, substitute));
}

...

// <main method where we're performing the logic>
{
    RegexReplacer replacer = GetSubstituteReplacer(context, rr, substitute);
    if (list == null)
    {
        ...
        return replacer.Replace(lvalString);
    }
    else
    {
        List<object> resultList = new List<object>();
        while (ParserOps.MoveNext(context, errorPosition, list))
        {
            string lvalString = PSObject.ToStringParser(context, ParserOps.Current(errorPosition, list));
            resultList.Add(replacer.Replace(lvalString));
        }
    }
}

With this:

  • There's no ReplaceOperatorImpl method at all (which I especially like because Impl is a smell)
  • We only look at the substitute once, before looking at the inputs, and execute that logic once (no more reflection-heavy switch inside the loop)
  • No more extraneous cached values to juggle and pass around. We only allocate the RegexReplacer object we need.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think creating classes is redundant in the context so I combined the proposal with your previous based on struct.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But it's not redundant. The class encapsulates a pre-computed result. We shouldn't be revisiting that switch statement inside the loop.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I can live with your most recent proposal

@iSazonov
Copy link
Collaborator Author

iSazonov commented Dec 3, 2020

This PR doesn't seem to add any tests... Do we know that this code is tested?

I added one test.

@ghost ghost removed the Review - Needed The PR is being reviewed label Dec 3, 2020
@ghost ghost added the Waiting on Author The PR was reviewed and requires changes or comments from the author before being accept label Dec 4, 2020
@ghost ghost removed the Waiting on Author The PR was reviewed and requires changes or comments from the author before being accept label Dec 4, 2020
@ghost ghost added the Waiting on Author The PR was reviewed and requires changes or comments from the author before being accept label Dec 8, 2020
Co-authored-by: Dongbo Wang <dongbow@microsoft.com>
@ghost ghost removed the Waiting on Author The PR was reviewed and requires changes or comments from the author before being accept label Dec 9, 2020
@rjmholt rjmholt requested a review from daxian-dbw December 9, 2020 06:09
Copy link
Member

@daxian-dbw daxian-dbw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

/// <param name="substitute">The substitute value.</param>
/// <returns>The result of the regex.Replace operation.</returns>
private static object ReplaceOperatorImpl(ExecutionContext context, string input, Regex regex, object substitute)
private struct ReplaceOperator
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turns out ReplaceOperator is already the name of a static method. I guess let's just call it ReplaceOperatorImpl then :)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:-))) Sorry, I'm already confused about the names.

Fixed.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've never liked Impl because it widens the "least representational gap", but I totally acknowledge it's fine in this case. Just want to make the case that I think we should be removing Impls from the code in general.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've never liked Impl

I tried to avoid it but .... :-)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe ReplaceOperatorProcessor?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to change it! You've done great work here @iSazonov -- thanks!

@rjmholt
Copy link
Collaborator

rjmholt commented Dec 9, 2020

@PoshChan remind me in 4 hours

@rjmholt rjmholt merged commit ba49eb2 into PowerShell:master Dec 9, 2020
@rjmholt
Copy link
Collaborator

rjmholt commented Dec 9, 2020

@iSazonov can you describe the documentation we need for this? I want to follow up on the label you added.

@iSazonov
Copy link
Collaborator Author

iSazonov commented Dec 9, 2020

@rjmholt It seems we haven't docs for the replace operator kind. I will open new issue in docs repo.

@iSazonov iSazonov deleted the perf-replcace-ienumerable branch December 9, 2020 20:03
@PoshChan
Copy link
Collaborator

PoshChan commented Dec 9, 2020

@rjmholt, this is the reminder you requested 4 hours ago

@ghost
Copy link

ghost commented Dec 15, 2020

🎉v7.2.0-preview.2 has been released which incorporates this pull request.:tada:

Handy links:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CL-Performance Indicates that a PR should be marked as a performance improvement in the Change Log

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants