-
Notifications
You must be signed in to change notification settings - Fork 8.1k
Address UTF-8 Detection In Get-Content -Tail #11899
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
- Addresses a comparison failure that causes UTF-8 detection to fail which in turn causes Get-Content -Tail to resort to forward lookups given encoding type cannot be detected. Possible this misdetection is due to the incoming encoding object as being of type System.Text.UTF8Encoding where as the comparison uses the object Encoding.UTF8 which is derived from System.Text.UTF8Encoding+UTF8EncodingSealed. - See PowerShell#11830
|
@NoMoreFood Please add a test. |
|
@iSazonov Since this addresses an internal performance issue, did you just want a textual output posted here with before/after performance tests for a variety of text encodings? |
|
@NoMoreFood I do not look the issue in depths but I feel that if UTF-8 detection wrong the cmdlet can return wrong results. In the case we should add tests. |
|
@iSazonov Alright, I'll take a look into creating a general, functional test. I will say though that this was not an issue on Windows PowerShell. And notably, Windows PowerShell was not failing the _currentEncoding.Equals(Encoding.UTF8) comparison that resulted in this problem. |
|
I imagine there may have been a change in the .NET Core API at some point that resulted in that failing. It seems on the surface like a relatively innocuous change until you start doing reference equality comparisons against it for things like this. 😁 |
|
@NoMoreFood Please check that your new tests fail on current version and do not fail after your fix. Thanks! |
- Added 'OEM', 'UTF8BOM', and 'UTF8NoBOM' as explicit encodings for existing Get-Content -Tail tests.
db266b2 to
868a814
Compare
|
@iSazonov It appears there are already functional tests for various encodings within the test suite. However, I added a few additional explicit encodings that were not in the enumeration and committed that change to this pull. Given this pull addresses a performance issue, I do not see how a reliable pass/fail test can be written given performance can vary drastically between systems and, to a lesser extent, between executions on the same system given other operating system activity. The only reasonable way to provide a test for something like this would be to add debug/tracing code to the code base to verify the desired code branch is hit/not hit; I do not see any precedent for this type of testing in other tests. In absence of that, I have provided a before and after demonstration that clearly shows the performance change on my system. Test Code: Before Changes: After Changes: Notice the difference the timings for the code that will be detected as UTF-8 (given the current character set used in the demonstration). The "after" results are similar to what you would see with Windows PowerShell. |
|
@NoMoreFood I guess the updated test does not fail on current version. Can you check? If so I suggest replace the test text (really it is one byte ASCII) $content = @"
one
two
foo
bar
baz
"@with a text having multi byte Unicode, ex.: $content = @"
один
два
фуу
бар
база
"@ |
|
@iSazonov Other parts of that test case will need to change as well to support multi-byte Unicode validation. It'll beef it up later today and submit an update. On the surface after a quick mod, it does not appear to behave any differently before or after these changes (except for the fact it's faster). More to come... |
- Modified -Tail encoding test to use three different test sets: utf-8, utf-16, utf-32. The test verifies that the content resulting from -Tail is equal to the same string returned from a regular Get-Content using both an explicit and implicit encoding.
|
@iSazonov Enhanced Get-Content / Get-Content -Tail tests have been added. |
|
@NoMoreFood Is there an issue filed for this fix? If not, can you please open an issue that describes the problem with repro steps? |
|
@daxian-dbw Yes, it's in the PR description: #11830 |
daxian-dbw
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM except for one comment.
Thanks for your contribution!
|
@NoMoreFood Thanks for your contribution! |
|
🎉 Handy links: |

PR Summary
Fix #11830
Addresses a comparison failure that causes UTF-8 detection to fail which in turn causes Get-Content -Tail to resort to forward lookups given encoding type cannot be detected. Possible this misdetection is due to the incoming encoding object as being of type System.Text.UTF8Encoding where as the comparison uses the object Encoding.UTF8 which is derived from System.Text.UTF8Encoding+UTF8EncodingSealed.
PR Context
Problem was discovered when investigating performance issues for Get-Content -Tail, in general. The problem was narrowed to the fact that -Tail will read the entire file when using Get-Content on a UTF-8 file.
PR Checklist
.h,.cpp,.cs,.ps1and.psm1files have the correct copyright headerWIP:or[ WIP ]to the beginning of the title (theWIPbot will keep its status check atPendingwhile the prefix is present) and remove the prefix when the PR is ready.