Skip to content

Conversation

@NoMoreFood
Copy link
Contributor

@NoMoreFood NoMoreFood commented Feb 20, 2020

PR Summary

Fix #11830
Addresses a comparison failure that causes UTF-8 detection to fail which in turn causes Get-Content -Tail to resort to forward lookups given encoding type cannot be detected. Possible this misdetection is due to the incoming encoding object as being of type System.Text.UTF8Encoding where as the comparison uses the object Encoding.UTF8 which is derived from System.Text.UTF8Encoding+UTF8EncodingSealed.

PR Context

Problem was discovered when investigating performance issues for Get-Content -Tail, in general. The problem was narrowed to the fact that -Tail will read the entire file when using Get-Content on a UTF-8 file.

PR Checklist

- Addresses a comparison failure that causes UTF-8 detection to fail which in turn causes Get-Content -Tail to resort to forward lookups given encoding type cannot be detected. Possible this misdetection is due to the incoming encoding object as being of type System.Text.UTF8Encoding where as the comparison uses the object Encoding.UTF8 which is derived from System.Text.UTF8Encoding+UTF8EncodingSealed.
- See PowerShell#11830
@iSazonov
Copy link
Collaborator

@NoMoreFood Please add a test.

@NoMoreFood
Copy link
Contributor Author

@iSazonov Since this addresses an internal performance issue, did you just want a textual output posted here with before/after performance tests for a variety of text encodings?

@iSazonov
Copy link
Collaborator

@NoMoreFood I do not look the issue in depths but I feel that if UTF-8 detection wrong the cmdlet can return wrong results. In the case we should add tests.

@NoMoreFood
Copy link
Contributor Author

@iSazonov Alright, I'll take a look into creating a general, functional test. I will say though that this was not an issue on Windows PowerShell. And notably, Windows PowerShell was not failing the _currentEncoding.Equals(Encoding.UTF8) comparison that resulted in this problem.

@vexx32
Copy link
Collaborator

vexx32 commented Feb 21, 2020

I imagine there may have been a change in the .NET Core API at some point that resulted in that failing. It seems on the surface like a relatively innocuous change until you start doing reference equality comparisons against it for things like this. 😁

@iSazonov
Copy link
Collaborator

@NoMoreFood Please check that your new tests fail on current version and do not fail after your fix. Thanks!

- Added 'OEM', 'UTF8BOM', and 'UTF8NoBOM' as explicit encodings for existing Get-Content -Tail tests.
@NoMoreFood
Copy link
Contributor Author

@iSazonov It appears there are already functional tests for various encodings within the test suite. However, I added a few additional explicit encodings that were not in the enumeration and committed that change to this pull. Given this pull addresses a performance issue, I do not see how a reliable pass/fail test can be written given performance can vary drastically between systems and, to a lesser extent, between executions on the same system given other operating system activity. The only reasonable way to provide a test for something like this would be to add debug/tracing code to the code base to verify the desired code branch is hit/not hit; I do not see any precedent for this type of testing in other tests.

In absence of that, I have provided a before and after demonstration that clearly shows the performance change on my system.

Test Code:

$Encodings = @('String','OEM','Unicode','BigEndianUnicode',
    'UTF8','UTF8BOM','UTF8NoBOM','UTF7','UTF32','Ascii')
$TempFile = (New-TemporaryFile).FullName
$Results = @()
ForEach ($Encoding in $Encodings)
{
    (1..2e6) | Set-Content -Encoding $Encoding -LiteralPath $TempFile -Force
    $Time = Measure-Command { $Capture = Get-Content -Tail 1 -LiteralPath $TempFile }
    $Results += New-Object PSObject -Property @{'Encoding'=$Encoding;'Time'=$Time.TotalMilliSeconds}
    Remove-Item -LiteralPath $TempFile -Force
}
$Results | Format-Table -AutoSize

Before Changes:

Encoding             Time
--------             ----
String            12.3904
OEM              585.5496
Unicode            1.2809
BigEndianUnicode   1.1258
UTF8             555.5736
UTF8BOM            0.9639
UTF8NoBOM        647.7248
UTF7             560.4215
UTF32              1.1847
Ascii            562.1341

After Changes:

Encoding           Time
--------           ----
String           2.8139
OEM              0.8078
Unicode          0.8523
BigEndianUnicode 0.9938
UTF8             0.7483
UTF8BOM          0.7559
UTF8NoBOM        0.7721
UTF7             0.8044
UTF32            1.1415
Ascii            0.9463

Notice the difference the timings for the code that will be detected as UTF-8 (given the current character set used in the demonstration). The "after" results are similar to what you would see with Windows PowerShell.

@iSazonov
Copy link
Collaborator

@NoMoreFood I guess the updated test does not fail on current version. Can you check? If so I suggest replace the test text (really it is one byte ASCII)

        $content = @"
one
two
foo
bar
baz
"@

with a text having multi byte Unicode, ex.:

        $content = @"
один
два
фуу
бар
база
"@

@NoMoreFood
Copy link
Contributor Author

@iSazonov Other parts of that test case will need to change as well to support multi-byte Unicode validation. It'll beef it up later today and submit an update. On the surface after a quick mod, it does not appear to behave any differently before or after these changes (except for the fact it's faster). More to come...

@iSazonov iSazonov added the CL-General Indicates that a PR should be marked as a general cmdlet change in the Change Log label Feb 21, 2020
@iSazonov iSazonov added this to the 7.1.0-preview.1 milestone Feb 21, 2020
- Modified -Tail encoding test to use three different test sets: utf-8, utf-16, utf-32.  The test verifies that the content resulting from -Tail is equal to the same string returned from a regular Get-Content using both an explicit and implicit encoding.
@NoMoreFood
Copy link
Contributor Author

@iSazonov Enhanced Get-Content / Get-Content -Tail tests have been added.

@daxian-dbw
Copy link
Member

@NoMoreFood Is there an issue filed for this fix? If not, can you please open an issue that describes the problem with repro steps?

@NoMoreFood
Copy link
Contributor Author

@daxian-dbw Yes, it's in the PR description: #11830

@daxian-dbw
Copy link
Member

Oh, I see. Maybe that part of the template is not clear enough :) That's for the doc issue if a documentation change is needed
image


The issue which the PR is addressing should be put in the PR description, like Fix #xxxx. I added it to your PR description.

Copy link
Member

@daxian-dbw daxian-dbw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM except for one comment.
Thanks for your contribution!

@iSazonov iSazonov merged commit 07962a9 into PowerShell:master Mar 14, 2020
@iSazonov
Copy link
Collaborator

@NoMoreFood Thanks for your contribution!

@NoMoreFood NoMoreFood deleted the encoding_detection branch March 16, 2020 03:28
@ghost
Copy link

ghost commented Mar 26, 2020

🎉v7.1.0-preview.1 has been released which incorporates this pull request.:tada:

Handy links:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CL-General Indicates that a PR should be marked as a general cmdlet change in the Change Log

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Get-Content -ReadCount 0 combined with -Last / -Tail seems to read ALL lines internally

5 participants