Skip to content

Conversation

@daxian-dbw
Copy link
Member

@daxian-dbw daxian-dbw commented Dec 2, 2020

PR Summary

Comparing with v7.1.0, the changes in this PR improves the startup of a stable version PowerShell by about 19% on a 8 logical core machine, by about 10% on a 4 logical core machine.
Stable version of PowerShell means: experimental features are disabled in the build by default, just like a GA version.

Major changes:

  1. Push the call to InitialSessionState.CreateDefault2() to a worker thread.
  2. Push the call to GetConfigurationNameFromGroupPolicy() to a worker thread.
    The method Utils.GetPolicySetting<T>(...) is pretty expensive at startup time, including loading system.collections.concurrent.dll, system.io.filesytem.dll, newtonsoft.json.dll, the overhead of using ConcurrentDictionary, and the overhead of reading settings from the configuration file.
  3. Push the call to SendPSCoreStartupTelemetry to a worker thread.
    This is also an expensive call at startup time, including loading microsoft.applicationinsights.dll, intializing the telemetry client and actually write out telemetry.

By pushing them to worker threads, the assembly loading and type initalization are all taken place on the worker thread, so as to have the main thread move forward until it has to block for any of the tasks to finish.

Implementation-wise,

  1. ConsoleHost.DefaultInitialSessionState is changed to make its getter able to either return a previously set InitialSessionState value, or waiting for a task result and return the result.
  2. Similarly, CommandLineParameterParser.ConfigurationName is changed to have its getter wait for the 'GetConfigurationNameFromGroupPolicy' task to finish.
  3. Telemetry doesn't block anything, but we need to make sure the startup telemetry gets sent before the process exits. (we ignore the cases where the process is forcefully terminated, in which case we cannot guarantee the telemetry to be sent anyways)
  4. Remove CreateRunspace(object runspaceCreationArgs) and RunspaceCreationEventArgs. They add an intermediate layer of method call, which is not needed at all.

Startup time comparison

The following results are 3 continuous run of the benchmarck program, which compares v7.1.0, v7.2-wip (this pr), and Windows PS 5.1.

Notes:

  1. since we are comparing with a stable version v7.1.0, v7.2-wip is built using Start-PSBuild -Clean -ReleaseTag v7.2.0 -CrossGen -Configuration Release to diable all experimental features (-ReleaseTag v7.2.0 means building for a stable version).
  2. the module analysis cache file is already created for all those 3 versions of PowerShell.

Benchmark Run

BenchmarkDotNet=v0.12.1, OS=Windows 10.0.19042
Intel Core i7-1065G7 CPU 1.30GHz, 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=5.0.100
  [Host]     : .NET Core 5.0.0 (CoreCLR 5.0.20.51904, CoreFX 5.0.20.51904), X64 RyuJIT
  DefaultJob : .NET Core 5.0.0 (CoreCLR 5.0.20.51904, CoreFX 5.0.20.51904), X64 RyuJIT
Method Mean Error StdDev Ratio RatioSD Code Size
V710StartupTime 384.8 ms 7.50 ms 10.99 ms 1.00 0.00 711 B
V72WipStartupTime 313.6 ms 5.26 ms 4.66 ms 0.81 0.03 218 B
WinPSStartupTime 211.2 ms 3.48 ms 2.91 ms 0.55 0.02 711 B
Method Mean Error StdDev Ratio RatioSD Code Size
V710StartupTime 391.4 ms 7.81 ms 10.16 ms 1.00 0.00 711 B
V72WipStartupTime 317.2 ms 6.10 ms 6.78 ms 0.81 0.03 218 B
WinPSStartupTime 215.3 ms 4.21 ms 7.48 ms 0.55 0.02 60 B
Method Mean Error StdDev Ratio RatioSD Code Size
V710StartupTime 394.4 ms 7.57 ms 9.84 ms 1.00 0.00 60 B
V72WipStartupTime 316.2 ms 6.25 ms 6.14 ms 0.81 0.02 218 B
WinPSStartupTime 217.8 ms 3.87 ms 5.29 ms 0.55 0.02 711 B

Benchmark program

using System.Diagnostics;

using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;

namespace benchmark
{
    class Program
    {
        static void Main(string[] args)
        {
            var summary = BenchmarkRunner.Run<Benchmark_pwsh_startup>();
        }
    }

    [DisassemblyDiagnoser(printSource: true)]
    public class Benchmark_pwsh_startup
    {
        private ProcessStartInfo v710, v72wip, winps;

        [GlobalSetup]
        public void Setup()
        {
            v710 = new ProcessStartInfo(@"C:\Program Files\PowerShell\7\pwsh.exe", "-noprofile -c echo 1");
            v72wip = new ProcessStartInfo(@"C:\arena\repos\PowerShell\src\powershell-win-core\bin\Release\net5.0\win7-x64\publish\pwsh.exe", "-noprofile -c echo 1");
            winps = new ProcessStartInfo(@"C:\Windows\System32\WindowsPowerShell\v1.0\powershell.exe", "-noprofile -c echo 1");
        }

        [Benchmark(Baseline = true)]
        public void V710StartupTime()
        {
            Process.Start(v710).WaitForExit();
        }

        [Benchmark]
        public void V72WipStartupTime()
        {
            Process.Start(v72wip).WaitForExit();
        }

        [Benchmark]
        public void WinPSStartupTime()
        {
            Process.Start(winps).WaitForExit();
        }
    }
}

PR Checklist

@ghost ghost assigned TravisEz13 Dec 2, 2020
"xWebAdministration",
"xWindowsUpdate",
};
s_knownModules = new Lazy<HashSet<string>>(GetKnownModules);
Copy link
Contributor

@xtqqczze xtqqczze Dec 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we could avoid the Lazy allocation with LazyInitializer.EnsureInitialized, like:

private static HashSet<string> KnownModules => LazyInitializer.EnsureInitialized(ref s_knownModules, GetKnownModules);

@daxian-dbw daxian-dbw marked this pull request as ready for review December 2, 2020 16:49
@daxian-dbw daxian-dbw requested a review from anmenaga as a code owner December 2, 2020 16:49
@iSazonov
Copy link
Collaborator

iSazonov commented Dec 2, 2020

Eh, I've always said that MSFT developers shouldn't be given the latest hardware :-)
When I started #14268 my first desire was to follow the EarlyStartup.Init() pattern and move slow operations to background threads but after testing on my old laptop, I did not see any performance gain.
From this I concluded that this approach is not well applicable if we want to be fast on high-load systems and low-power equipment (IoT, Wasm and so on).
(We already create many CPU-bound threads and adding new ones only makes sense for hardware with many cores - this is definitely not my laptop :-) )
So I did some PRs with direct performance improvements. @daxian-dbw Could you please consider them first? Even if we decide to move some from them to background threads, it will make them also lighter and faster.

I especially have doubts about GetPolicySetting(). It is slow totally because of newtonsoft.json. It would be more promising for us to migrate to .Net Json and see the result. It will definitely be better. If we see problems we can ask the .Net team, which continues to actively work on it.

(Perhaps we need take new look on EarlyStartup.Init().)

@daxian-dbw daxian-dbw marked this pull request as draft December 2, 2020 19:15
@daxian-dbw
Copy link
Member Author

@iSazonov Thanks for raising your concerns. I run the benchmark on a 2017-year 4 logical core machine (so not a latest hardware 😄), and I still observe over 10% improvements in the startup.
Below are some of the results:

BenchmarkDotNet=v0.12.1, OS=Windows 10.0.19042
Intel Core i5-7300U CPU 2.60GHz (Kaby Lake), 1 CPU, 4 logical and 2 physical cores
.NET Core SDK=5.0.100
  [Host]     : .NET Core 5.0.0 (CoreCLR 5.0.20.51904, CoreFX 5.0.20.51904), X64 RyuJIT
  DefaultJob : .NET Core 5.0.0 (CoreCLR 5.0.20.51904, CoreFX 5.0.20.51904), X64 RyuJIT
Method Mean Error StdDev Ratio RatioSD Code Size
V710StartupTime 429.1 ms 5.67 ms 5.30 ms 1.00 0.00 218 B
V72WipStartupTime 372.6 ms 5.68 ms 5.03 ms 0.87 0.02 218 B
WinPSStartupTime 254.2 ms 3.12 ms 2.77 ms 0.59 0.01 40 B
Method Mean Error StdDev Ratio RatioSD Code Size
V710StartupTime 427.8 ms 6.32 ms 5.60 ms 1.00 0.00 218 B
V72WipStartupTime 377.5 ms 5.18 ms 4.33 ms 0.88 0.02 40 B
WinPSStartupTime 252.5 ms 2.90 ms 2.57 ms 0.59 0.01 40 B
Method Mean Error StdDev Ratio RatioSD Code Size
V710StartupTime 431.0 ms 6.08 ms 5.69 ms 1.00 0.00 40 B
V72WipStartupTime 373.3 ms 5.20 ms 4.61 ms 0.87 0.01 160 B
WinPSStartupTime 269.0 ms 5.00 ms 7.18 ms 0.62 0.02 699 B

@iSazonov
Copy link
Collaborator

iSazonov commented Dec 2, 2020

@iSazonov Thanks for raising your concerns. I run the benchmark on a 2017-year 4 logical core machine (so not a latest hardware 😄), and I still observe over 10% improvements in the startup.

2017? :-) How would you classify my 2011 laptop? :-) Note that it is significantly newer than Windows 7 which still runs PowerShell 7. :-)

Seriously, my suggestion is to first make direct corrections, open cases in .Net if necessary (they already shared plans to speed up runtime initialization in 6.0 milestone), and only before next RC to evaluate how much we need to do additional threads.

@daxian-dbw
Copy link
Member Author

daxian-dbw commented Dec 2, 2020

I agree that leveraging worker thread for parallel processing won't result in much difference in high-load system or low-power/low-rank equipment. However, that's not a reason to not use it, because it's obvious that many more users, who are using the suitable equipments, will benefit from it.

Parallel processing is not the answer to everything (Amdahl's law), but it surely is a good tool as long as you use it cautiously -- be careful about the number of threads you spin up and be careful about what work loads to process in parallel, so as to avoid contention/blocking as much as possible.

Also, it doesn't contradict with the work you have been doing or the improvement that .NET 6 will bring to us, does it?

@daxian-dbw
Copy link
Member Author

daxian-dbw commented Dec 2, 2020

2017? :-) How would you classify my 2011 laptop? :-) Note that it is significantly newer than Windows 7 which still runs PowerShell 7.

I don't think the OS version matters too much. It's more about how many logical cores the machine has, and of course, how much work load the machine is having at the moment.

But can you please run the benchmark on your 2011 machine and share the results? I would like to see if there is any penalty from context switching on a low-power machine. Please make sure you build my branch with Start-PSBuild -Clean -ReleaseTag v7.2.0 -CrossGen -Configuration Release, and then make sure its module cache is populated before running the benchmark. That can be done by 1) start the generated pwsh.exe, 2) run a few commands, 3) wait for a minute or two, for safety.

@daxian-dbw daxian-dbw mentioned this pull request Dec 2, 2020
14 tasks
@iSazonov
Copy link
Collaborator

iSazonov commented Dec 3, 2020

But can you please run the benchmark on your 2011 machine and share the results?

BenchmarkDotNet=v0.12.1, OS=Windows 10.0.18363.1198 (1909/November2018Update/19H2)
Intel Core i5-2410M CPU 2.30GHz (Sandy Bridge), 1 CPU, 4 logical and 2 physical cores
.NET Core SDK=5.0.100
  [Host]     : .NET Core 5.0.0 (CoreCLR 5.0.20.51904, CoreFX 5.0.20.51904), X64 RyuJIT
  DefaultJob : .NET Core 5.0.0 (CoreCLR 5.0.20.51904, CoreFX 5.0.20.51904), X64 RyuJIT

Method Mean Error StdDev Ratio RatioSD Code Size
V710StartupTime 686.8 ms 7.93 ms 6.19 ms 1.00 0.00 218 B
V72WipStartupTime 682.2 ms 10.90 ms 9.66 ms 0.99 0.02 218 B
WinPSStartupTime 365.0 ms 4.85 ms 4.30 ms 0.53 0.01 218 B

It is stable result in some runs.
This seem to confirm that moving works in background threads is not so good for startup scenario.
As I said in my PR, it's much better to ask the ApplicationInsights owners to improve this scenario and do not create additional threads.

@daxian-dbw
Copy link
Member Author

daxian-dbw commented Dec 3, 2020

Thanks for sharing the results!

This seem to confirm that moving works in background threads is not so good for startup scenario.

Nah, this is actually not bad at all, as it shows no penalty (context switching and etc.) on your 2011 laptop.
Again, leveraging parallel processing doesn't contradict with the work that improves individual code paths. If it's handled correctly, they amplify the final result.

@iSazonov
Copy link
Collaborator

iSazonov commented Dec 4, 2020

it shows no penalty (context switching and etc.)

I say not about a penalty. From my experience Windows works well on 2 cores (including applications) and I guess we will see the same result for our test on more slow hardware.
I say about (1) roadmap and (2) how many benefits we will get from parallelization in the startup scenario.
It is well known that replacing a bad algorithm with a better one can have the greatest effect. Also, direct fixes are generally more effective than workarounds. For example, in our case, we can get a benefit of 20% from direct fixes, and from the subsequent parallelization only 1%, even on fast equipment because the code is already fast.
So my suggestion for a roadmap is to work on direct fixes here and in other repositories before RC and then investigate parallelization.
This will simplify the final performance checks before 7.2 release.

@daxian-dbw
Copy link
Member Author

Don't think parallelization as a workaround, think it as a tool. There is no good or bad about a tool, but only right way or wrong way to use it. As I said above, the right way is "be careful about the number of threads you spin up and be careful about what work loads to process in parallel, so as to avoid contention/blocking as much as possible."

An example as to the work load aspect:

  1. Today, DefaultInitialSessionState = CreateDefault2() is called in ManagedEntrance.Start, but the created InitialSessionState instance is not used until in DoCreateRunspace. I did some measurement on my "latest" machine, CreateDefault2() taske ~36 ms, and then it takes ~40 ms for the code flow to reach the first use of DefaultInitialSessionState in DoCreateRunspace.

  2. Meanwhile, the CreateDefault2() call is pretty selfcontained, no synchronization is needed when having it run on a worker thread.

Together, they make CreateDefault2() a good candidate work-load to run as a task, as of today.

For example, in our case, we can get a benefit of 20% from direct fixes, and from the subsequent parallelization only 1%, even on fast equipment because the code is already fast.

This sounds like a pretty extragerated hypothetical example :) But, if we cannot find the right workload for parallelization, then that just means we should not use that tool.

I want to emphsize again, parallelization doesn't contradict with the work that improves individual code paths. If it's handled correctly, they amplify the final result.

So my suggestion for a roadmap is to work on direct fixes here and in other repositories before RC and then investigate parallelization.

Yeah, I agree to not rush in this PR. As I mentioned in our discussion at #14281 (comment), more work and testing is needed to "fine-tune the decisions like how many tasks to use, and what work load shall be done in tasks." And decisions will change as more perf work on our side and .NET side get in.

But I believe we cannot wait till the RC timeframe, as changes in this PR (or similar PRs) need sufficient bake time.

@iSazonov
Copy link
Collaborator

iSazonov commented Dec 4, 2020

But I believe we cannot wait till the RC timeframe, as changes in this PR (or similar PRs) need sufficient bake time.

I mean last Preview before RC is good time to check, investigate and fix performance issues.

@iSazonov
Copy link
Collaborator

iSazonov commented Dec 4, 2020

Today my test show regression up to 20% because ... slow DNS request from ApplicationInsights! I saw this previously. So any network issue delays PowerShell startup due to ApplicationInsights - good argument to move it in a different thread.

@daxian-dbw
Copy link
Member Author

@iSazonov Are you sure it's due to ApplicationInsights? Did you try disabling the telemetry to prove that's the cause? As we discussed in #14281 (comment), ApplicationInsight doesn't send telemetry packets synchronously.

What you saw looks to me more likely a variant of #10983 -- DNS issue may cause the certificate security check to block.

@iSazonov
Copy link
Collaborator

iSazonov commented Dec 5, 2020

@iSazonov Are you sure it's due to ApplicationInsights? Did you try disabling the telemetry to prove that's the cause? As we discussed in #14281 (comment), ApplicationInsight doesn't send telemetry packets synchronously.

I saw DnsResolver in ApplicationInsights constructor.
It would be easier to investigate if the code was open source. :-)

@daxian-dbw
Copy link
Member Author

daxian-dbw commented Dec 6, 2020

I'm pretty sure ApplicationInsight sends telemetries in an asynchronous way. We are using the default configuraiton with InMemoryChannel, whose behavior is (see "more telemetry channels" for details):

InMemory channel Microsoft.ApplicationInsights.Channel.InMemoryChannel is a lightweight loosy telemetry channel that is used by default. It will batch data and initiate sending every SendingInterval (default is 30 seconds) or when number of items exceeded MaxTelemetryBufferCapacity (default is 500). It also will not retry if it failed to send telemetry.

When using in-memory channel, the Flush call is synchronous. It forces data to be sent immediately.

But, interestingly, this also indicates that today for the pwsh -c xxx scenario, PowerShell never actully send out telemetry packets if the execution takes less than 30 seconds. Need to discuss with @JamesWTruher about whether that's a bug and should be fixed by .Flush.

@iSazonov
Copy link
Collaborator

iSazonov commented Dec 7, 2020

Sources https://github.com/microsoft/ApplicationInsights-dotnet :-)

GitHub
ApplicationInsights-dotnet. Contribute to microsoft/ApplicationInsights-dotnet development by creating an account on GitHub.

@ghost ghost added the Waiting on Author The PR was reviewed and requires changes or comments from the author before being accept label Dec 10, 2020
@SteveL-MSFT SteveL-MSFT added this to the 7.2-Consider milestone Dec 16, 2020
@ghost ghost removed this from the 7.2-Consider milestone Dec 16, 2020
@ghost
Copy link

ghost commented Dec 16, 2020

Open PRs should not be assigned to milestone, so they are not assigned to the wrong milestone after they are merged. For backport consideration, use a backport label.

@ghost
Copy link

ghost commented Jan 1, 2021

This pull request has been automatically marked as stale because it has been marked as requiring author feedback but has not had any activity for 15 days. It will be closed if no further activity occurs within 10 days of this comment.

@ghost ghost removed the Stale label Jan 1, 2021
@ghost ghost added the Stale label Jan 16, 2021
@ghost
Copy link

ghost commented Jan 16, 2021

This pull request has been automatically marked as stale because it has been marked as requiring author feedback but has not had any activity for 15 days. It will be closed if no further activity occurs within 10 days of this comment.

@daxian-dbw daxian-dbw removed Stale Waiting on Author The PR was reviewed and requires changes or comments from the author before being accept labels Jan 20, 2021
@daxian-dbw daxian-dbw closed this Jan 20, 2021
@daxian-dbw daxian-dbw deleted the startup branch January 20, 2021 18:25
@daxian-dbw daxian-dbw restored the startup branch January 20, 2021 18:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants