Improve startup time by pushing some work off the main thread to thread pool #14304

daxian-dbw · 2020-12-02T07:08:55Z

PR Summary

Comparing with v7.1.0, the changes in this PR improves the startup of a stable version PowerShell by about 19% on a 8 logical core machine, by about 10% on a 4 logical core machine.
Stable version of PowerShell means: experimental features are disabled in the build by default, just like a GA version.

Major changes:

Push the call to InitialSessionState.CreateDefault2() to a worker thread.
Push the call to GetConfigurationNameFromGroupPolicy() to a worker thread.
The method Utils.GetPolicySetting<T>(...) is pretty expensive at startup time, including loading system.collections.concurrent.dll, system.io.filesytem.dll, newtonsoft.json.dll, the overhead of using ConcurrentDictionary, and the overhead of reading settings from the configuration file.
Push the call to SendPSCoreStartupTelemetry to a worker thread.
This is also an expensive call at startup time, including loading microsoft.applicationinsights.dll, intializing the telemetry client and actually write out telemetry.

By pushing them to worker threads, the assembly loading and type initalization are all taken place on the worker thread, so as to have the main thread move forward until it has to block for any of the tasks to finish.

Implementation-wise,

ConsoleHost.DefaultInitialSessionState is changed to make its getter able to either return a previously set InitialSessionState value, or waiting for a task result and return the result.
Similarly, CommandLineParameterParser.ConfigurationName is changed to have its getter wait for the 'GetConfigurationNameFromGroupPolicy' task to finish.
Telemetry doesn't block anything, but we need to make sure the startup telemetry gets sent before the process exits. (we ignore the cases where the process is forcefully terminated, in which case we cannot guarantee the telemetry to be sent anyways)
Remove CreateRunspace(object runspaceCreationArgs) and RunspaceCreationEventArgs. They add an intermediate layer of method call, which is not needed at all.

Startup time comparison

The following results are 3 continuous run of the benchmarck program, which compares v7.1.0, v7.2-wip (this pr), and Windows PS 5.1.

Notes:

since we are comparing with a stable version v7.1.0, v7.2-wip is built using Start-PSBuild -Clean -ReleaseTag v7.2.0 -CrossGen -Configuration Release to diable all experimental features (-ReleaseTag v7.2.0 means building for a stable version).
the module analysis cache file is already created for all those 3 versions of PowerShell.

Benchmark Run

BenchmarkDotNet=v0.12.1, OS=Windows 10.0.19042
Intel Core i7-1065G7 CPU 1.30GHz, 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=5.0.100
  [Host]     : .NET Core 5.0.0 (CoreCLR 5.0.20.51904, CoreFX 5.0.20.51904), X64 RyuJIT
  DefaultJob : .NET Core 5.0.0 (CoreCLR 5.0.20.51904, CoreFX 5.0.20.51904), X64 RyuJIT

Method	Mean	Error	StdDev	Ratio	RatioSD	Code Size
V710StartupTime	384.8 ms	7.50 ms	10.99 ms	1.00	0.00	711 B
V72WipStartupTime	313.6 ms	5.26 ms	4.66 ms	0.81	0.03	218 B
WinPSStartupTime	211.2 ms	3.48 ms	2.91 ms	0.55	0.02	711 B

Method	Mean	Error	StdDev	Ratio	RatioSD	Code Size
V710StartupTime	391.4 ms	7.81 ms	10.16 ms	1.00	0.00	711 B
V72WipStartupTime	317.2 ms	6.10 ms	6.78 ms	0.81	0.03	218 B
WinPSStartupTime	215.3 ms	4.21 ms	7.48 ms	0.55	0.02	60 B

Method	Mean	Error	StdDev	Ratio	RatioSD	Code Size
V710StartupTime	394.4 ms	7.57 ms	9.84 ms	1.00	0.00	60 B
V72WipStartupTime	316.2 ms	6.25 ms	6.14 ms	0.81	0.02	218 B
WinPSStartupTime	217.8 ms	3.87 ms	5.29 ms	0.55	0.02	711 B

Benchmark program

using System.Diagnostics;

using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;

namespace benchmark
{
    class Program
    {
        static void Main(string[] args)
        {
            var summary = BenchmarkRunner.Run<Benchmark_pwsh_startup>();
        }
    }

    [DisassemblyDiagnoser(printSource: true)]
    public class Benchmark_pwsh_startup
    {
        private ProcessStartInfo v710, v72wip, winps;

        [GlobalSetup]
        public void Setup()
        {
            v710 = new ProcessStartInfo(@"C:\Program Files\PowerShell\7\pwsh.exe", "-noprofile -c echo 1");
            v72wip = new ProcessStartInfo(@"C:\arena\repos\PowerShell\src\powershell-win-core\bin\Release\net5.0\win7-x64\publish\pwsh.exe", "-noprofile -c echo 1");
            winps = new ProcessStartInfo(@"C:\Windows\System32\WindowsPowerShell\v1.0\powershell.exe", "-noprofile -c echo 1");
        }

        [Benchmark(Baseline = true)]
        public void V710StartupTime()
        {
            Process.Start(v710).WaitForExit();
        }

        [Benchmark]
        public void V72WipStartupTime()
        {
            Process.Start(v72wip).WaitForExit();
        }

        [Benchmark]
        public void WinPSStartupTime()
        {
            Process.Start(winps).WaitForExit();
        }
    }
}

PR Checklist

xtqqczze · 2020-12-02T14:07:09Z

src/System.Management.Automation/utils/Telemetry.cs

-                        "xWebAdministration",
-                        "xWindowsUpdate",
-                    };
+                s_knownModules = new Lazy<HashSet<string>>(GetKnownModules);


Can we could avoid the Lazy allocation with LazyInitializer.EnsureInitialized, like:

private static HashSet<string> KnownModules => LazyInitializer.EnsureInitialized(ref s_knownModules, GetKnownModules);

iSazonov · 2020-12-02T17:08:57Z

Eh, I've always said that MSFT developers shouldn't be given the latest hardware :-)
When I started #14268 my first desire was to follow the EarlyStartup.Init() pattern and move slow operations to background threads but after testing on my old laptop, I did not see any performance gain.
From this I concluded that this approach is not well applicable if we want to be fast on high-load systems and low-power equipment (IoT, Wasm and so on).
(We already create many CPU-bound threads and adding new ones only makes sense for hardware with many cores - this is definitely not my laptop :-) )
So I did some PRs with direct performance improvements. @daxian-dbw Could you please consider them first? Even if we decide to move some from them to background threads, it will make them also lighter and faster.

I especially have doubts about GetPolicySetting(). It is slow totally because of newtonsoft.json. It would be more promising for us to migrate to .Net Json and see the result. It will definitely be better. If we see problems we can ask the .Net team, which continues to actively work on it.

(Perhaps we need take new look on EarlyStartup.Init().)

daxian-dbw · 2020-12-02T19:21:23Z

@iSazonov Thanks for raising your concerns. I run the benchmark on a 2017-year 4 logical core machine (so not a latest hardware 😄), and I still observe over 10% improvements in the startup.
Below are some of the results:

BenchmarkDotNet=v0.12.1, OS=Windows 10.0.19042
Intel Core i5-7300U CPU 2.60GHz (Kaby Lake), 1 CPU, 4 logical and 2 physical cores
.NET Core SDK=5.0.100
  [Host]     : .NET Core 5.0.0 (CoreCLR 5.0.20.51904, CoreFX 5.0.20.51904), X64 RyuJIT
  DefaultJob : .NET Core 5.0.0 (CoreCLR 5.0.20.51904, CoreFX 5.0.20.51904), X64 RyuJIT

Method	Mean	Error	StdDev	Ratio	RatioSD	Code Size
V710StartupTime	429.1 ms	5.67 ms	5.30 ms	1.00	0.00	218 B
V72WipStartupTime	372.6 ms	5.68 ms	5.03 ms	0.87	0.02	218 B
WinPSStartupTime	254.2 ms	3.12 ms	2.77 ms	0.59	0.01	40 B

Method	Mean	Error	StdDev	Ratio	RatioSD	Code Size
V710StartupTime	427.8 ms	6.32 ms	5.60 ms	1.00	0.00	218 B
V72WipStartupTime	377.5 ms	5.18 ms	4.33 ms	0.88	0.02	40 B
WinPSStartupTime	252.5 ms	2.90 ms	2.57 ms	0.59	0.01	40 B

Method	Mean	Error	StdDev	Ratio	RatioSD	Code Size
V710StartupTime	431.0 ms	6.08 ms	5.69 ms	1.00	0.00	40 B
V72WipStartupTime	373.3 ms	5.20 ms	4.61 ms	0.87	0.01	160 B
WinPSStartupTime	269.0 ms	5.00 ms	7.18 ms	0.62	0.02	699 B

iSazonov · 2020-12-02T19:50:18Z

@iSazonov Thanks for raising your concerns. I run the benchmark on a 2017-year 4 logical core machine (so not a latest hardware 😄), and I still observe over 10% improvements in the startup.

2017? :-) How would you classify my 2011 laptop? :-) Note that it is significantly newer than Windows 7 which still runs PowerShell 7. :-)

Seriously, my suggestion is to first make direct corrections, open cases in .Net if necessary (they already shared plans to speed up runtime initialization in 6.0 milestone), and only before next RC to evaluate how much we need to do additional threads.

daxian-dbw · 2020-12-02T19:57:35Z

I agree that leveraging worker thread for parallel processing won't result in much difference in high-load system or low-power/low-rank equipment. However, that's not a reason to not use it, because it's obvious that many more users, who are using the suitable equipments, will benefit from it.

Parallel processing is not the answer to everything (Amdahl's law), but it surely is a good tool as long as you use it cautiously -- be careful about the number of threads you spin up and be careful about what work loads to process in parallel, so as to avoid contention/blocking as much as possible.

Also, it doesn't contradict with the work you have been doing or the improvement that .NET 6 will bring to us, does it?

daxian-dbw · 2020-12-02T20:04:13Z

2017? :-) How would you classify my 2011 laptop? :-) Note that it is significantly newer than Windows 7 which still runs PowerShell 7.

I don't think the OS version matters too much. It's more about how many logical cores the machine has, and of course, how much work load the machine is having at the moment.

But can you please run the benchmark on your 2011 machine and share the results? I would like to see if there is any penalty from context switching on a low-power machine. Please make sure you build my branch with Start-PSBuild -Clean -ReleaseTag v7.2.0 -CrossGen -Configuration Release, and then make sure its module cache is populated before running the benchmark. That can be done by 1) start the generated pwsh.exe, 2) run a few commands, 3) wait for a minute or two, for safety.

iSazonov · 2020-12-03T11:54:37Z

But can you please run the benchmark on your 2011 machine and share the results?

BenchmarkDotNet=v0.12.1, OS=Windows 10.0.18363.1198 (1909/November2018Update/19H2)
Intel Core i5-2410M CPU 2.30GHz (Sandy Bridge), 1 CPU, 4 logical and 2 physical cores
.NET Core SDK=5.0.100
  [Host]     : .NET Core 5.0.0 (CoreCLR 5.0.20.51904, CoreFX 5.0.20.51904), X64 RyuJIT
  DefaultJob : .NET Core 5.0.0 (CoreCLR 5.0.20.51904, CoreFX 5.0.20.51904), X64 RyuJIT

Method	Mean	Error	StdDev	Ratio	RatioSD	Code Size
V710StartupTime	686.8 ms	7.93 ms	6.19 ms	1.00	0.00	218 B
V72WipStartupTime	682.2 ms	10.90 ms	9.66 ms	0.99	0.02	218 B
WinPSStartupTime	365.0 ms	4.85 ms	4.30 ms	0.53	0.01	218 B

It is stable result in some runs.
This seem to confirm that moving works in background threads is not so good for startup scenario.
As I said in my PR, it's much better to ask the ApplicationInsights owners to improve this scenario and do not create additional threads.

daxian-dbw · 2020-12-03T22:32:51Z

Thanks for sharing the results!

This seem to confirm that moving works in background threads is not so good for startup scenario.

Nah, this is actually not bad at all, as it shows no penalty (context switching and etc.) on your 2011 laptop.
Again, leveraging parallel processing doesn't contradict with the work that improves individual code paths. If it's handled correctly, they amplify the final result.

iSazonov · 2020-12-04T03:18:19Z

it shows no penalty (context switching and etc.)

I say not about a penalty. From my experience Windows works well on 2 cores (including applications) and I guess we will see the same result for our test on more slow hardware.
I say about (1) roadmap and (2) how many benefits we will get from parallelization in the startup scenario.
It is well known that replacing a bad algorithm with a better one can have the greatest effect. Also, direct fixes are generally more effective than workarounds. For example, in our case, we can get a benefit of 20% from direct fixes, and from the subsequent parallelization only 1%, even on fast equipment because the code is already fast.
So my suggestion for a roadmap is to work on direct fixes here and in other repositories before RC and then investigate parallelization.
This will simplify the final performance checks before 7.2 release.

daxian-dbw · 2020-12-04T06:31:45Z

Don't think parallelization as a workaround, think it as a tool. There is no good or bad about a tool, but only right way or wrong way to use it. As I said above, the right way is "be careful about the number of threads you spin up and be careful about what work loads to process in parallel, so as to avoid contention/blocking as much as possible."

An example as to the work load aspect:

Today, DefaultInitialSessionState = CreateDefault2() is called in ManagedEntrance.Start, but the created InitialSessionState instance is not used until in DoCreateRunspace. I did some measurement on my "latest" machine, CreateDefault2() taske ~36 ms, and then it takes ~40 ms for the code flow to reach the first use of DefaultInitialSessionState in DoCreateRunspace.
Meanwhile, the CreateDefault2() call is pretty selfcontained, no synchronization is needed when having it run on a worker thread.

Together, they make CreateDefault2() a good candidate work-load to run as a task, as of today.

For example, in our case, we can get a benefit of 20% from direct fixes, and from the subsequent parallelization only 1%, even on fast equipment because the code is already fast.

This sounds like a pretty extragerated hypothetical example :) But, if we cannot find the right workload for parallelization, then that just means we should not use that tool.

I want to emphsize again, parallelization doesn't contradict with the work that improves individual code paths. If it's handled correctly, they amplify the final result.

So my suggestion for a roadmap is to work on direct fixes here and in other repositories before RC and then investigate parallelization.

Yeah, I agree to not rush in this PR. As I mentioned in our discussion at #14281 (comment), more work and testing is needed to "fine-tune the decisions like how many tasks to use, and what work load shall be done in tasks." And decisions will change as more perf work on our side and .NET side get in.

But I believe we cannot wait till the RC timeframe, as changes in this PR (or similar PRs) need sufficient bake time.

iSazonov · 2020-12-04T06:41:05Z

But I believe we cannot wait till the RC timeframe, as changes in this PR (or similar PRs) need sufficient bake time.

I mean last Preview before RC is good time to check, investigate and fix performance issues.

iSazonov · 2020-12-04T08:00:41Z

Today my test show regression up to 20% because ... slow DNS request from ApplicationInsights! I saw this previously. So any network issue delays PowerShell startup due to ApplicationInsights - good argument to move it in a different thread.

daxian-dbw · 2020-12-05T01:45:46Z

@iSazonov Are you sure it's due to ApplicationInsights? Did you try disabling the telemetry to prove that's the cause? As we discussed in #14281 (comment), ApplicationInsight doesn't send telemetry packets synchronously.

What you saw looks to me more likely a variant of #10983 -- DNS issue may cause the certificate security check to block.

iSazonov · 2020-12-05T03:21:18Z

@iSazonov Are you sure it's due to ApplicationInsights? Did you try disabling the telemetry to prove that's the cause? As we discussed in #14281 (comment), ApplicationInsight doesn't send telemetry packets synchronously.

I saw DnsResolver in ApplicationInsights constructor.
It would be easier to investigate if the code was open source. :-)

daxian-dbw · 2020-12-06T06:47:46Z

I'm pretty sure ApplicationInsight sends telemetries in an asynchronous way. We are using the default configuraiton with InMemoryChannel, whose behavior is (see "more telemetry channels" for details):

InMemory channel Microsoft.ApplicationInsights.Channel.InMemoryChannel is a lightweight loosy telemetry channel that is used by default. It will batch data and initiate sending every SendingInterval (default is 30 seconds) or when number of items exceeded MaxTelemetryBufferCapacity (default is 500). It also will not retry if it failed to send telemetry.

When using in-memory channel, the Flush call is synchronous. It forces data to be sent immediately.

But, interestingly, this also indicates that today for the pwsh -c xxx scenario, PowerShell never actully send out telemetry packets if the execution takes less than 30 seconds. Need to discuss with @JamesWTruher about whether that's a bug and should be fixed by .Flush.

iSazonov · 2020-12-07T10:32:54Z

Sources https://github.com/microsoft/ApplicationInsights-dotnet :-)

GitHub
microsoft/ApplicationInsights-dotnet
ApplicationInsights-dotnet. Contribute to microsoft/ApplicationInsights-dotnet development by creating an account on GitHub.

ghost · 2020-12-16T23:07:31Z

Open PRs should not be assigned to milestone, so they are not assigned to the wrong milestone after they are merged. For backport consideration, use a backport label.

ghost · 2021-01-01T02:00:06Z

This pull request has been automatically marked as stale because it has been marked as requiring author feedback but has not had any activity for 15 days. It will be closed if no further activity occurs within 10 days of this comment.

ghost · 2021-01-16T05:00:05Z

This pull request has been automatically marked as stale because it has been marked as requiring author feedback but has not had any activity for 15 days. It will be closed if no further activity occurs within 10 days of this comment.

daxian-dbw added 2 commits December 1, 2020 21:41

Improve pwsh startup by over 16%

39c7eb7

Remove the unneeded 'CreateRunspace' method

0f0308f

ghost assigned TravisEz13 Dec 2, 2020

xtqqczze reviewed Dec 2, 2020

View reviewed changes

daxian-dbw marked this pull request as ready for review December 2, 2020 16:49

daxian-dbw requested a review from anmenaga as a code owner December 2, 2020 16:49

daxian-dbw marked this pull request as draft December 2, 2020 19:15

daxian-dbw mentioned this pull request Dec 2, 2020

Send telemetry lazy at startup #14281

Closed

14 tasks

ghost added the Waiting on Author The PR was reviewed and requires changes or comments from the author before being accept label Dec 10, 2020

SteveL-MSFT added this to the 7.2-Consider milestone Dec 16, 2020

ghost removed this from the 7.2-Consider milestone Dec 16, 2020

SteveL-MSFT added the Backport-7.2.x-Consider label Dec 16, 2020

ghost added the Stale label Jan 1, 2021

ghost removed the Stale label Jan 1, 2021

ghost added the Stale label Jan 16, 2021

daxian-dbw removed Stale Waiting on Author The PR was reviewed and requires changes or comments from the author before being accept labels Jan 20, 2021

daxian-dbw closed this Jan 20, 2021

daxian-dbw deleted the startup branch January 20, 2021 18:25

daxian-dbw restored the startup branch January 20, 2021 18:25

adityapatwardhan removed the Backport-7.2.x-Consider label Sep 15, 2021

Improve startup time by pushing some work off the main thread to thread pool #14304

Improve startup time by pushing some work off the main thread to thread pool #14304

Uh oh!

Conversation

daxian-dbw commented Dec 2, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Summary

Startup time comparison

Benchmark program

PR Checklist

Uh oh!

xtqqczze Dec 2, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

iSazonov commented Dec 2, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

daxian-dbw commented Dec 2, 2020

Uh oh!

iSazonov commented Dec 2, 2020

Uh oh!

daxian-dbw commented Dec 2, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

daxian-dbw commented Dec 2, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

iSazonov commented Dec 3, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

daxian-dbw commented Dec 3, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

iSazonov commented Dec 4, 2020

Uh oh!

daxian-dbw commented Dec 4, 2020

Uh oh!

iSazonov commented Dec 4, 2020

Uh oh!

iSazonov commented Dec 4, 2020

Uh oh!

daxian-dbw commented Dec 5, 2020

Uh oh!

iSazonov commented Dec 5, 2020

Uh oh!

daxian-dbw commented Dec 6, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

iSazonov commented Dec 7, 2020 • edited by unfurl-links bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ghost commented Dec 16, 2020

Uh oh!

ghost commented Jan 1, 2021

Uh oh!

ghost commented Jan 16, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

daxian-dbw commented Dec 2, 2020 •

edited

Loading

xtqqczze Dec 2, 2020 •

edited

Loading

iSazonov commented Dec 2, 2020 •

edited

Loading

daxian-dbw commented Dec 2, 2020 •

edited

Loading

daxian-dbw commented Dec 2, 2020 •

edited

Loading

iSazonov commented Dec 3, 2020 •

edited

Loading

daxian-dbw commented Dec 3, 2020 •

edited

Loading

daxian-dbw commented Dec 6, 2020 •

edited

Loading

iSazonov commented Dec 7, 2020 •

edited by unfurl-links bot

Loading