Address UTF-8 Detection In Get-Content -Tail #11899

NoMoreFood · 2020-02-20T01:03:29Z

PR Summary

Fix #11830
Addresses a comparison failure that causes UTF-8 detection to fail which in turn causes Get-Content -Tail to resort to forward lookups given encoding type cannot be detected. Possible this misdetection is due to the incoming encoding object as being of type System.Text.UTF8Encoding where as the comparison uses the object Encoding.UTF8 which is derived from System.Text.UTF8Encoding+UTF8EncodingSealed.

PR Context

Problem was discovered when investigating performance issues for Get-Content -Tail, in general. The problem was narrowed to the fact that -Tail will read the entire file when using Get-Content on a UTF-8 file.

PR Checklist

- Addresses a comparison failure that causes UTF-8 detection to fail which in turn causes Get-Content -Tail to resort to forward lookups given encoding type cannot be detected. Possible this misdetection is due to the incoming encoding object as being of type System.Text.UTF8Encoding where as the comparison uses the object Encoding.UTF8 which is derived from System.Text.UTF8Encoding+UTF8EncodingSealed. - See PowerShell#11830

iSazonov · 2020-02-20T15:00:29Z

@NoMoreFood Please add a test.

src/System.Management.Automation/namespaces/FileSystemContentStream.cs

NoMoreFood · 2020-02-20T21:46:39Z

@iSazonov Since this addresses an internal performance issue, did you just want a textual output posted here with before/after performance tests for a variety of text encodings?

iSazonov · 2020-02-21T03:18:32Z

@NoMoreFood I do not look the issue in depths but I feel that if UTF-8 detection wrong the cmdlet can return wrong results. In the case we should add tests.

NoMoreFood · 2020-02-21T06:10:07Z

@iSazonov Alright, I'll take a look into creating a general, functional test. I will say though that this was not an issue on Windows PowerShell. And notably, Windows PowerShell was not failing the _currentEncoding.Equals(Encoding.UTF8) comparison that resulted in this problem.

vexx32 · 2020-02-21T06:11:26Z

I imagine there may have been a change in the .NET Core API at some point that resulted in that failing. It seems on the surface like a relatively innocuous change until you start doing reference equality comparisons against it for things like this. 😁

iSazonov · 2020-02-21T06:14:06Z

@NoMoreFood Please check that your new tests fail on current version and do not fail after your fix. Thanks!

- Added 'OEM', 'UTF8BOM', and 'UTF8NoBOM' as explicit encodings for existing Get-Content -Tail tests.

NoMoreFood · 2020-02-21T08:50:38Z

@iSazonov It appears there are already functional tests for various encodings within the test suite. However, I added a few additional explicit encodings that were not in the enumeration and committed that change to this pull. Given this pull addresses a performance issue, I do not see how a reliable pass/fail test can be written given performance can vary drastically between systems and, to a lesser extent, between executions on the same system given other operating system activity. The only reasonable way to provide a test for something like this would be to add debug/tracing code to the code base to verify the desired code branch is hit/not hit; I do not see any precedent for this type of testing in other tests.

In absence of that, I have provided a before and after demonstration that clearly shows the performance change on my system.

Test Code:

$Encodings = @('String','OEM','Unicode','BigEndianUnicode',
    'UTF8','UTF8BOM','UTF8NoBOM','UTF7','UTF32','Ascii')
$TempFile = (New-TemporaryFile).FullName
$Results = @()
ForEach ($Encoding in $Encodings)
{
    (1..2e6) | Set-Content -Encoding $Encoding -LiteralPath $TempFile -Force
    $Time = Measure-Command { $Capture = Get-Content -Tail 1 -LiteralPath $TempFile }
    $Results += New-Object PSObject -Property @{'Encoding'=$Encoding;'Time'=$Time.TotalMilliSeconds}
    Remove-Item -LiteralPath $TempFile -Force
}
$Results | Format-Table -AutoSize

Before Changes:

Encoding             Time
--------             ----
String            12.3904
OEM              585.5496
Unicode            1.2809
BigEndianUnicode   1.1258
UTF8             555.5736
UTF8BOM            0.9639
UTF8NoBOM        647.7248
UTF7             560.4215
UTF32              1.1847
Ascii            562.1341

After Changes:

Encoding           Time
--------           ----
String           2.8139
OEM              0.8078
Unicode          0.8523
BigEndianUnicode 0.9938
UTF8             0.7483
UTF8BOM          0.7559
UTF8NoBOM        0.7721
UTF7             0.8044
UTF32            1.1415
Ascii            0.9463

Notice the difference the timings for the code that will be detected as UTF-8 (given the current character set used in the demonstration). The "after" results are similar to what you would see with Windows PowerShell.

iSazonov · 2020-02-21T09:25:57Z

@NoMoreFood I guess the updated test does not fail on current version. Can you check? If so I suggest replace the test text (really it is one byte ASCII)

        $content = @"
one
two
foo
bar
baz
"@

with a text having multi byte Unicode, ex.:

        $content = @"
один
два
фуу
бар
база
"@

NoMoreFood · 2020-02-21T12:13:16Z

@iSazonov Other parts of that test case will need to change as well to support multi-byte Unicode validation. It'll beef it up later today and submit an update. On the surface after a quick mod, it does not appear to behave any differently before or after these changes (except for the fact it's faster). More to come...

- Modified -Tail encoding test to use three different test sets: utf-8, utf-16, utf-32. The test verifies that the content resulting from -Tail is equal to the same string returned from a regular Get-Content using both an explicit and implicit encoding.

NoMoreFood · 2020-02-21T13:09:28Z

@iSazonov Enhanced Get-Content / Get-Content -Tail tests have been added.

daxian-dbw · 2020-03-13T06:39:21Z

@NoMoreFood Is there an issue filed for this fix? If not, can you please open an issue that describes the problem with repro steps?

NoMoreFood · 2020-03-13T08:16:17Z

@daxian-dbw Yes, it's in the PR description: #11830

daxian-dbw · 2020-03-13T16:15:38Z

Oh, I see. Maybe that part of the template is not clear enough :) That's for the doc issue if a documentation change is needed

The issue which the PR is addressing should be put in the PR description, like Fix #xxxx. I added it to your PR description.

src/System.Management.Automation/namespaces/FileSystemContentStream.cs

daxian-dbw

LGTM except for one comment.
Thanks for your contribution!

iSazonov · 2020-03-14T04:57:21Z

@NoMoreFood Thanks for your contribution!

ghost · 2020-03-26T17:48:21Z

🎉v7.1.0-preview.1 has been released which incorporates this pull request.:tada:

Handy links:

Release Notes

NoMoreFood requested a review from anmenaga as a code owner February 20, 2020 01:03

ghost assigned adityapatwardhan Feb 20, 2020

vexx32 reviewed Feb 20, 2020

View reviewed changes

src/System.Management.Automation/namespaces/FileSystemContentStream.cs Show resolved Hide resolved

Add Additional Explicit Encodings For Get-Content -Tail Tests

868a814

- Added 'OEM', 'UTF8BOM', and 'UTF8NoBOM' as explicit encodings for existing Get-Content -Tail tests.

NoMoreFood force-pushed the encoding_detection branch from db266b2 to 868a814 Compare February 21, 2020 08:22

iSazonov added the CL-General Indicates that a PR should be marked as a general cmdlet change in the Change Log label Feb 21, 2020

iSazonov added this to the 7.1.0-preview.1 milestone Feb 21, 2020

iSazonov approved these changes Feb 21, 2020

View reviewed changes

daxian-dbw reviewed Mar 13, 2020

View reviewed changes

src/System.Management.Automation/namespaces/FileSystemContentStream.cs Show resolved Hide resolved

daxian-dbw approved these changes Mar 13, 2020

View reviewed changes

Remove BigEndianUnicode Reference In Comment

6c62890

iSazonov assigned iSazonov and unassigned adityapatwardhan Mar 14, 2020

iSazonov merged commit 07962a9 into PowerShell:master Mar 14, 2020

NoMoreFood deleted the encoding_detection branch March 16, 2020 03:28

ghost mentioned this pull request Mar 26, 2020

Get-Content -ReadCount 0 combined with -Last / -Tail seems to read ALL lines internally #11830

Closed

iSazonov mentioned this pull request Nov 12, 2020

Get-Content -Tail parameter is broken in v7.1.0-rc.2 and v7.1.0 #14028

Closed

Address UTF-8 Detection In Get-Content -Tail #11899

Address UTF-8 Detection In Get-Content -Tail #11899

Uh oh!

Conversation

NoMoreFood commented Feb 20, 2020 • edited by daxian-dbw Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Summary

PR Context

PR Checklist

Uh oh!

iSazonov commented Feb 20, 2020

Uh oh!

Uh oh!

NoMoreFood commented Feb 20, 2020

Uh oh!

iSazonov commented Feb 21, 2020

Uh oh!

NoMoreFood commented Feb 21, 2020

Uh oh!

vexx32 commented Feb 21, 2020

Uh oh!

iSazonov commented Feb 21, 2020

Uh oh!

NoMoreFood commented Feb 21, 2020

Uh oh!

iSazonov commented Feb 21, 2020

Uh oh!

NoMoreFood commented Feb 21, 2020

Uh oh!

NoMoreFood commented Feb 21, 2020

Uh oh!

daxian-dbw commented Mar 13, 2020

Uh oh!

NoMoreFood commented Mar 13, 2020

Uh oh!

daxian-dbw commented Mar 13, 2020

Uh oh!

Uh oh!

daxian-dbw left a comment

Choose a reason for hiding this comment

Uh oh!

iSazonov commented Mar 14, 2020

Uh oh!

ghost commented Mar 26, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

NoMoreFood commented Feb 20, 2020 •

edited by daxian-dbw

Loading