Skip to content

Conversation

@JamesWTruher
Copy link
Collaborator

Implementation of https://github.com/PowerShell/PowerShell-RFC/blob/master/1-Draft/RFC0020-DefaultFileEncoding.md

Cmdlets now create files with consistent encoding (UTF8 without BOM) on all platforms.
A new preference variable PSDefaultFileEncoding is now available to enable users to set encoding for cmdlets. By setting $PSDefaultFileEncoding = "WindowsLegacy" users can select the encoding which exists in PowerShell5 for each specific cmdlet.

Provider and cmdlet encoding methods have been centralized and are now common.

@JamesWTruher JamesWTruher added WG-Cmdlets general cmdlet issues WG-Cmdlets-Management cmdlets in the Microsoft.PowerShell.Management module Breaking-Change breaking change that may affect users labels Jun 27, 2017
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can use static cache for UTF8Encoding(false); here and in Line 142 too - the cache is in Line 311.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point

@lzybkr
Copy link
Contributor

lzybkr commented Jun 29, 2017

Please fix the conflict and rebase.

@JamesWTruher JamesWTruher force-pushed the jameswtruher/encoding3 branch 2 times, most recently from a013795 to 9f77970 Compare June 29, 2017 22:44
@iSazonov
Copy link
Collaborator

If the new test batch covers all the old encoding variants, it would be good to merge the tests before the PR - then we'll be sure to see that the PR has full backward compatibility.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extra empty line

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use legacyEncodingMap.TryGetValue instead

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to care about WindowsLegacyEncoding here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes - because of the provider cmdlets which leave the encoding to the underlying provider as a dynamic parameter it needs to be handled via a different path than checking against the name of the cmdlet.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Delete the commented block?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Delete commented block

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed - I missed that one

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you check the code coverage with the newly added tests for this file? So we add tests now if we are missing any.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

now that we can, i will :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why aren't you using a ValidateSet with the FileEncoding values?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could use #3784 - dynamic validate set, don't expose FileEncoding and allow enhancement to use legacy charsets.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't want to use validateset here at all. Especially since this is so pervasive in so many areas, I get the same more easily with an enum

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above, Why aren't you using a ValidateSet with the FileEncoding values

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default should be FileEncoding.UTF8. See BeginProcessing.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i want to be sure that I discover the encoding. If I set it to something that is likely to be provided, I won't be able to determine whether the user explicitly set it. By setting it to unknown, I'll calculate the appropriate value and I can tell that it wasn't set by the user

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest checking for encoding != FileEncoding.Unknown first; none of the other code paths are executed for that case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use initialBytes.Length instead of 100.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you really need to read 100 bytes?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was part of a refactor where this code has been around for a long time, and I didn't really want to change it. I'm certainly open reducing the size, but it's not too much, not too little and any number here would be a guess anyway, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we look beyond the preamble - a page size seems fine because we'll read that much from disk anyway, so 100 is totally fine.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest naming this GetFileEncoding for clarity.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should probably throw; not return Default.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not necessarily - the file could be later created, so I need to pass back something - this code was refactored from another location, so i wanted to do as little as possible

Copy link
Contributor

@dantraMSFT dantraMSFT Jul 11, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect the string building/lookups are overkill. Consider testing the bytes directly?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was code which was refactored from another location, I didn't really want to change it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please do fix it, or open an issue to get it fixed. Testing bytes directly is much better - no extra allocations, no static initialization. It's simple code too - here is my PowerShell version: https://gist.github.com/lzybkr/f040f18d1fbfff9eaf3f4533e24126fe#file-fix_trailing_ws-ps1-L14 - note that my version handles UFT7 which this version does not, but mine misses the big endian encodings.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This appears to be the only external call to Encoding overload of PathUtils.MasterStreamOpen. Suggest using FileEncoding.Ascii here and merge the two MasterStreamOpen overloads.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yah - I was worried about removing the overload as it was a public interface and thought to do less harm, just in case someone else was using it somehow.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MasterStreamOpen is internal.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll investigate

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the encoding overload is used by CSV cmdlets which do their own determination of the encoding (in the append case), and pass that along to the open. I'd rather not refactor all that code if I don't need to.

Copy link
Contributor

@dantraMSFT dantraMSFT Jul 11, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can find no references to this field other than setting the session state variable. How is it used and how does it control behavior? Currently, it appears to be a NOP.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's used down in line 4915 where it essentially lays claim to the variable

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this actually needed? The only other change to this file is the removal of the GetEncoding logic.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the CORECLR branches could use commenting in both GetDefaultEncoding and GetOEMEncoding. The logic reasoning for GetACP and GetOEMCP is opaque.

Copy link
Collaborator

@iSazonov iSazonov Jul 12, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The discussion was in #3248 and fixed in #3467. Maybe it should be reviewed again. In short we returned behavior Windows PowerShell and exclude the breaking change.

Copy link
Collaborator Author

@JamesWTruher JamesWTruher Jul 19, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add comments about this, we can change it later after we have live-fire feedback.

Copy link
Contributor

@dantraMSFT dantraMSFT Jul 11, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest merging this with the Encoding overload. There appears to be only 1 caller to the later overload and it does so by converting a FileEncoding to an Encoding to accomplish it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest using a dictionary for these cascading 'if' tests.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dead code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dead code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is not referenced.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

@SteveL-MSFT
Copy link
Member

@PowerShell/powershell-committee reviewed this and we should have an RFC specific to the new public apis (and generally any new public apis should be RFCs going forward)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like this should be GetEncodingFromFile to keep the naming consistent

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return Unspecified, not return unknown

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this api would be better if we passed a SessionState instead of Cmdlet.

Create new class PowerShellEncoding and enum FileEncoding to unify cmdlet and provider code for file encoding.
Created PowerShellEncoding class and FileEncoding enum and removed ClrFacade.GetDefaultEncoding.
PSDefaultFileEncoding preference variable now can set file encoding across all cmdlets.
Setting PSDefaultFileEncoding to WindowsLegacy will set file encoding to historic PowerShell5 encodings.
some tests were failing on Windows because new line is different, calculate the bytes
in newline rather than hardcoding them
ClrFacade retains some of its functionality, but now relies on PowerShellEncoding class for
knowing what the default coding is. The encoding methods which calls native methods is retained.
the WindowsLegacy behavior for New-ModuleManifest should get the correct number of
bytes which will change depending on how many bytes are encoded for [Environment]::NewLine
update calls which create had been creating a new instance of the Utf8 encoding without BOM
to return the available static
only use hardcoded bytes when it's a custom file generation
(like Export-CliXml or New-ModuleManifest)
or when we're looking at a set of partial results
also remove an unused function
…ing files

improve code coverage for PowerShellEncoding class and don't duplicate byte representations when they're not needed
Remove a couple of extraneous using statements
Move encoding code from PathUtils.cs to Encoding.cs
Move and rename GetEncoding method in Utils.cs to GetFileEncodingFromFile in Encoding.cs
Expand some explanatory comments with more details
…oser to what is really happening

Update RedirectionOperator tests to compare bytes in a more sensible manner
Remove PSDefaultFileEncoding from special variable collection so they don't show up in script cmdlets
miscellaneous code clean up
made file encoding probe method internal
{ "239-187-191", FileEncoding.UTF8BOM },
};

internal static char[] nonPrintableCharacters = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strictly speaking, 11 (vertical tab) and 12 (form feed) are also printable characters; given their rarity, I assume they were left out intentionally; if so, perhaps a comment would appease drive-by pedants like me.

#else // PowerShell Core on Windows, which needs provider registration
EncodingRegisterProvider();

uint oemCp = NativeMethods.GetOEMCP();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this line is no longer needed.

// The OEM code pages are sometimes used by Win32 console applications, and
// on non-Windows platforms they still may have uses (if installed) and
// could be used if desired.
// On non-windows platforms, they have more limited uses, and probably won't
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sentence is a bit confusing. Doesn't the previous one say everything that needs to be said?

{ "microsoft.powershell.commands.setcontentcommand", Encoding.ASCII },
// Providers are handled here
{ "microsoft.powershell.commands.filesystemprovider", Encoding.ASCII },
};
Copy link
Contributor

@mklement0 mklement0 Aug 28, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, that doesn't appear to be correct.

As for the writing cmdlets:

Add-Content and Set-Content - despite what the documentation claims - have always used "ANSI" encoding, not ASCII - see MicrosoftDocs/PowerShell-Docs#1483.

What about Send-MailMessage? The help topics claim that ASCII encoding is the default.

New-Item -Type File -Value currently creates BOM-less(!) UTF-8.

As an aside: While Export-Csv without -Append indeed creates ASCII files, Export-Csv -Append currently blindly appends UTF-8 if the existing file's encoding is any of ASCII/UTF-8/"ANSI", but correctly matches UTF-16LE and UTF-16BE.


As for the reading cmdlets:

Get-Content currently defaults to "ANSI" in the absence of a BOM, as does Import-PowerShellDataFile.

Import-Csv currently actually assumes UTF-8 in the absence of a BOM, as do Import-CliXml and Select-String

@JamesWTruher
Copy link
Collaborator Author

i am going to take a different route for this change - closing this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Breaking-Change breaking change that may affect users Committee-Reviewed PS-Committee has reviewed this and made a decision WG-Cmdlets general cmdlet issues WG-Cmdlets-Management cmdlets in the Microsoft.PowerShell.Management module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants