Skip to content

Conversation

@iSazonov
Copy link
Collaborator

@iSazonov iSazonov commented Sep 7, 2018

PR Summary

Related #2230.

PR Checklist

@lzybkr
Copy link
Contributor

lzybkr commented Sep 7, 2018

I'm not sure this is a good idea - why we would want to use 2 different interpreters?

@iSazonov
Copy link
Collaborator Author

iSazonov commented Sep 7, 2018

I think that it does not make sense to port (update) the code from CoreFX here and it's better to migrate to CoreFX interpreter. I see that it doesn't support Compile() overload with threshold. So first step is to use Compile(preferInterpretation: true);.
The second step is in question. We could use tiered compilation (instead of Compile() overload with threshold) and remove our interpreter.
https://blogs.msdn.microsoft.com/dotnet/2018/08/02/tiered-compilation-preview-in-net-core-2-1/
There already is TieredCompilation_Tier1CallCountThreshold = 30
https://github.com/dotnet/coreclr/blob/f6174b93d100d46f4641f040b6de5fa254c1ee71/Documentation/project-docs/clr-configuration-knobs.md

From https://github.com/dotnet/coreclr/issues/4331 I see that we can get benefits for crossgened code too.

@iSazonov
Copy link
Collaborator Author

iSazonov commented Sep 7, 2018

I see #7729 - grossgen doesn't work with framework-dependent deployment and tired compilation will come in handy.
/cc @SteveL-MSFT

@iSazonov
Copy link
Collaborator Author

iSazonov commented Sep 7, 2018

I checked with PerfView that TC works. Seems the full set of tests is performed on CIs about the same time.

@lzybkr
Copy link
Contributor

lzybkr commented Sep 7, 2018

Why do you think it's better to migrate? Do you have data that shows it's faster?

It does not jit compile, so loops will be ~50X slower.

@daxian-dbw
Copy link
Member

daxian-dbw commented Sep 8, 2018

@iSazonov The interpreter in PowerShell was updated to act like the tiered compilation. For a script block or a loop in it that contains less than 300 statements, the script block and the loop will initially be evaluated in the interpreted way (fast startup), and after running for a certain number of times, they will be compiled and executed in the jitted native code (better stable performance).

The JIT tiered compilation cannot replace this optimization. The tiered compilation will optimize the Run methods from certain instructions, but no matter how JIT is able to optimize those individual methods, the script is still being evaluated in an interpreted way -- fetch an instruction, data gets pushed to a stack in the interpreter, run some C# code, pop the data, save to local variable list, fetch the next instruction, etc. However, after a compiled delegate gets created on demand, the script will be running directly in the jitted native code, and it's possible for the compiled delegate to further benefit from the tiered compilation and get even better performance.

You can see it as a 3-tiered compilation.

@iSazonov
Copy link
Collaborator Author

iSazonov commented Sep 10, 2018

I added a commit with test hook to switch interpreter/compiler.
Test script:

[System.Management.Automation.Internal.InternalTestHooks]::SetTestHook("ExpressionCompile", $true)

$step1 = measure-command {
for ($i = 0; $i -lt 30; $i++) {
    $a+=1
}
}


[System.Management.Automation.Internal.InternalTestHooks]::SetTestHook("ExpressionCompile", $false)
$step2 = measure-command {
for ($i = 0; $i -lt 30; $i++) {
    $a+=1
}
}

$step1.TotalMilliseconds
$step2.TotalMilliseconds

Results (copy-paste the script to console ):
<Updated because of a bug in test script>

Iterations TC Crossgened interpreter TC Crossgened Compiler Crossgened interpreter Crossgened Compiler 6.1 RC
30 4 9 4 10 4
300 6 10 7 11 6
1000 17 15 14 13 9
3000 34 19 34 19 17
30000 266 88 296 94 92

The results (although I can not consider these results reliable) show that with current change we get better results then with 6.1 RC.

I think it's worth it to study further.

/cc @powercode maybe you will be interested.

@lzybkr
Copy link
Contributor

lzybkr commented Sep 10, 2018

Are you certain your experiment is valid? It's possible you only execute one code path because of caching.

@iSazonov
Copy link
Collaborator Author

@lzybkr What cache do you mean? I added a test hook to explicitly switch from interpreter (CoreFX) to compiler.

@lzybkr
Copy link
Contributor

lzybkr commented Sep 10, 2018

PowerShell caches script block definitions to avoid recompiling, so I was just asking that you confirm your experiment hits the code paths you expect.

I measured a 50X slowdown when switching to the CoreFx interpreter and this did not surprise me because their interpreter no longer supports JIT.

Your results do surprise me - if JIT is happening at some point I'm happy, but I'd like pointers to where that happens or at least a solid explanation of what magic is making the new interpreter faster.

@iSazonov
Copy link
Collaborator Author

I run the test script in interactive session by manually copy-paste and use debugger to confirm that Compile(true)/Compile(false) is called after each the copy-paste.
I tied to run the script in cycle and get other results (many times faster). There I guess was a cache.
I am also surprised by these results. So far I'm inclined not to trust myself. Maybe I'm doing something wrong.

@daxian-dbw
Copy link
Member

daxian-dbw commented Sep 10, 2018

The measurement is too specific and doesn't reflect real scenarios, here are the reasons:

  1. The testing script contains minimal statements, so the cost for JIT compiling is low and it doesn't reflect the cost you would get for a real scenario script with relatively many statements.

  2. When there are too many statements in a script block, it would be too expensive to JIT compile it (cause very slow startup). The tradeoff PowerShell takes is to NeverCompile the script block in that case. With the current CompileOnDemand policy, even though the script block as a whole will never be compiled, the loops in it will still be JIT compiled after running for certian times, as long as they don't contain too many statements (> 300). However, with your change, nothing will be JIT compiled in that case, and the performance will very likely decrease comparied to the current CompileOnDemand.

And BTW, I guess tiered compilation was turned on in your local builds when compared with PS 6.1-RC. That would be another factor that changes the numbers you get from the measurements, even though I don't know how much difference that would make.

@iSazonov
Copy link
Collaborator Author

iSazonov commented Sep 11, 2018

Yes, tiered compilation was turned. Now I tested without it and see ~20% decrease in performance in the scenario.

My main concern was that the interpreter would be much faster in the interactive session, but this fear did not materialize. It seems we could use the compilation of even small scripts in an interactive session. We can even get some benefits because even one line script can be an CPU expensive cycle and compile + tiered compilation seem to bring improvements.

Next test I did for Parser.

$text = ""
foreach ($file in dir -Recurse -Path .\test\powershell\ -Filter "*.ps1") {
    $text = dir -Recurse -Path C:\Users\sie\Documents\GitHub\iSazonov\PowerShell\test\powershell\ -Filter "*.ps1" | Get-Content -Raw 
}


Invoke-Command  -ScriptBlock { for ($j = 0; $j -lt 100; $j++) {

$step3 = measure-command {
foreach ($t in $text) {
    $tokens = $null
    $errors = $null
    [Management.Automation.Language.Parser]::ParseInput($t, [ref]$tokens, [ref]$errors) | Out-Null
}
}
$step3.Milliseconds

}} | Measure-Object -Average

Results for compile (in TC build) - 659 ms for RC1 and 662 ms for TC build - slower ~0.5%.
Results for interpreter - 673 ms for TC build - slower ~2.0%.

(I should note that the variance of th TC build is greater in al tests.)

In the scenario it seems TC does not give any advantages, but it's more likely that the parser is a high-quality code and crossgen is very good too.

Also this test shows that the following test for slow start (@daxian-dbw's point 2) will show the difference only for compilation or interpretation, because parsing will consume the same time.

@iSazonov
Copy link
Collaborator Author

iSazonov commented Sep 11, 2018

Test for small script:

[System.Management.Automation.Internal.InternalTestHooks]::SetTestHook("ExpressionCompile", $false)

$step4a = measure-command {

for ($j1 = 0l; $j1 -lt 30000; $j1++) {

Invoke-Command  -ScriptBlock {

$no = $false
if ($no) {
    # 1
    $a += $a + 1
    $b -= $b - 1
    $c *= $c * 1Mb * 1Kb
    $d /= $d / 1Tb
    $e = "a,b,c,d" -split ","
    $f = "a", "b", "c", "d" -join ";"
    $g = New-Guid
    $i = [Math]::Max(1234567890, 12345678901234567890)
    $j = (1, 2, 3, 4, 5, 6, 7, 8, 9, 0)[9]
    $k = 1, 2, 3, 4, 5, 6, 7, 8, 9, 0 | Measure-Object
}

} # end Invoke-Command

} # end for

} # end measure-command


[System.Management.Automation.Internal.InternalTestHooks]::SetTestHook("ExpressionCompile", $true)

$step4b = measure-command {

for ($j1 = 0l; $j1 -lt 30000; $j1++) {

Invoke-Command  -ScriptBlock {

$no = $false
if ($no) {
    # 1
    $a += $a + 1
    $b -= $b - 1
    $c *= $c * 1Mb * 1Kb
    $d /= $d / 1Tb
    $e = "a,b,c,d" -split ","
    $f = "a", "b", "c", "d" -join ";"
    $g = New-Guid
    $i = [Math]::Max(1234567890, 12345678901234567890)
    $j = (1, 2, 3, 4, 5, 6, 7, 8, 9, 0)[9]
    $k = 1, 2, 3, 4, 5, 6, 7, 8, 9, 0 | Measure-Object
}

} # end Invoke-Command

} # end for

} # end measure-command


$step4a.TotalMilliseconds
$step4b.TotalMilliseconds

Results:
<Updated because of a bug in test script>

Iterations RC1 Crossgened Interpretator Crossgened Compile
30 11 11 24
300 32 34 36
3000 199 252 190
30000 1863 1928 1310

This result shows that even on RC1 we do possibly not need to compile small scripts. Perhaps this is due to a change between .Net Core and Framework.

@iSazonov
Copy link
Collaborator Author

Another test:

[System.Management.Automation.Internal.InternalTestHooks]::SetTestHook("ExpressionCompile", $false)
$step1 = measure-command {
for ($ii = 0l; $ii -lt 300; $ii++) {
    $a+=1
    $a += $a + 1
    $b -= $b - 1
    $c *= $c * 1Mb * 1Kb
    $d = 1;$d /= $d / 1Tb
    $e = "a,b,c,d" -split ","
    $f = "a", "b", "c", "d" -join ";"
    $g = New-Guid
    $i = [Math]::Max(123, 1234)
    $j = (1, 2, 3, 4, 5, 6, 7, 8, 9, 0)[9]
    $k = 1, 2, 3, 4, 5, 6, 7, 8, 9, 0 | Measure-Object}
}


[System.Management.Automation.Internal.InternalTestHooks]::SetTestHook("ExpressionCompile", $true)
$step2 = measure-command {
for ($ii = 0l; $ii -lt 300; $ii++) {
    $a+=1
    $a += $a + 1
    $b -= $b - 1
    $c *= $c * 1Mb * 1Kb
    $d = 1;$d /= $d / 1Tb
    $e = "a,b,c,d" -split ","
    $f = "a", "b", "c", "d" -join ";"
    $g = New-Guid
    $i = [Math]::Max(123, 1234)
    $j = (1, 2, 3, 4, 5, 6, 7, 8, 9, 0)[9]
    $k = 1, 2, 3, 4, 5, 6, 7, 8, 9, 0 | Measure-Object}
}

$step1.TotalMilliseconds
$step2.TotalMilliseconds

Results:

Iterations RC1 Crossgened Interpretator Crossgened Compile
30 11 23 19
300 50 78 53
3000 386 594 411
5000 634 978 670
10000 1256 1923 1299
20000 2534 3684 2522
30000 3747 5560 3819

@iSazonov
Copy link
Collaborator Author

I updated previous test results bacause of a bug in test script.
Now I see that most likely neither compile nor CoreFX interpreter will not give an improvement.
On the other hand, TC gives an improvement in performance.

@iSazonov iSazonov changed the title Use new Compile() overload in CompileTree() WIP: Use new Compile() overload in CompileTree() Sep 12, 2018
@iSazonov
Copy link
Collaborator Author

.Net Core team announced .Net Core 2.2.0 Preview2 with TC enabled by default.
https://blogs.msdn.microsoft.com/dotnet/2018/09/12/announcing-net-core-2-2-preview-2/

We should definitely continue to investigate the effect of TC on PowerShell Core.

@daxian-dbw
Copy link
Member

In my native measurement, there is 7% startup time improvement with crossgen'ed pwsh + tiered compilation enabled.

@iSazonov
Copy link
Collaborator Author

@daxian-dbw In the blog article PowerShell startup improvement 20% was mentioned. Could you contact directly men who did the test? Perhaps they had more PowerShell performance tests and could give advices on how to fine-tune TC for PowerShell.

@daxian-dbw
Copy link
Member

daxian-dbw commented Sep 18, 2018

@iSazonov I asked for details of the measurement, and it turned out the measurement was made with PSCore 6.0 code base built with release configuration without crossgen. PowerShell was used as a Fx Dependent application in the measurement and executed by dotnet .\bin\release\netcoreapp2.0\win7-x64\pwsh.dll -command exit.

So, the result of the measurements indicates that with a snapshot of the code base at 6.0 timeframe, the perf improvement in .NET Core 2.1 runtime and tiered compilation combined together offer a 20% startup improvement in the Fx Dependent scenario.

However, we got some degradation in startup time in the 6.1 timeframe, and the rough sources include TaskbarJumpList, Experimental Feautre Flag (configuration file access) and more. From my naive measurement, without tiered compilation, 6.1 is about 7% slower than 6.0 in startup time (crossgen'ed), and after turning on tiered compilation, 6.1 is about the same as 6.0. I'm looking in to the degradation.

@iSazonov
Copy link
Collaborator Author

@daxian-dbw Thanks! It is interesting!

TaskbarJumpList

I think we could move this to install phase (to msi custom action).

@stale
Copy link

stale bot commented Oct 19, 2018

This PR has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed if no further activity occurs within 10 days.
Thank you for your contributions.
Community members are welcome to grab these works.

@stale stale bot added the Stale label Oct 19, 2018
@iSazonov
Copy link
Collaborator Author

Make sense ask CoreFX team to implement LightCompiler(threshold) ( 3-tiered compilation)?

@stale stale bot removed the Stale label Oct 23, 2018
@lzybkr
Copy link
Contributor

lzybkr commented Oct 23, 2018

You can certainly ask, though it might be enough to add support for custom instructions in the interpreter.

Today, PowerShell loops are implemented as a custom instruction which switches to the jit compiled version of the loop after sufficient iterations. Our core interpreter works similarly, but PowerShell loops were just special enough that it was a little easier to create a new instruction.

If CoreFX allowed this sort of extension, both loops and entire functions could implement their own tiered compilation strategy.

@stale
Copy link

stale bot commented Nov 22, 2018

This PR has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed if no further activity occurs within 10 days.
Thank you for your contributions.
Community members are welcome to grab these works.

@stale stale bot added the Stale label Nov 22, 2018
@iSazonov iSazonov closed this Nov 29, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants