Implements #4752, and more: line endings standardized, converted all files to BOM-less UTF-8 #4761

mklement0 · 2017-09-06T18:45:35Z

Implements #4752, though not exactly as asked:
This PR does not replace the copyright/trademark symbols with their equivalent Unicode escape sequences, but instead - in line with BOM-less UTF-8 becoming the default - performs the following tasks:

all CRLF newlines were converted to LF-only newlines
all (by definition BOM-less) Windows-1252-encoded files were converted to BOM-less UTF-8.
all UTF-8 files with BOM had their BOM removed.
a few files contained invalid characters; they were replaced/removed.

In other words: all files are now BOM-less UTF-8 files with LF-only line endings, and, ideally, only such files should be added in the future.

msftclas · 2017-09-06T18:45:39Z

@mklement0,
Thanks for having already signed the Contribution License Agreement. Your agreement was validated by Microsoft. We will now review your pull request.
Thanks,
Microsoft Pull Request Bot

SteveL-MSFT · 2017-09-06T22:44:16Z

Please fix the test failures. Also, submit with [feature] as first line in commit description so the feature tests get run.

…zed, converted all file to BOM-less UTF-8

rkeithhill · 2017-09-06T23:52:55Z

I get the re-encoding work but why change the line-endings to LF? Why not let autocrlf continue to do its magic? If you do go through with this change, then you should update .gitattributes to enforce the EOL seq e.g * text eol=lf.

SteveL-MSFT · 2017-09-07T01:05:51Z

Based on discussion with @lzybkr we both agree that our files should just be ascii. Any characters outside of that range should be escaped appropriately. Also, agree with @rkeithhill that line endings shouldn't change as git will convert as necessary.

mklement0 · 2017-09-07T01:36:14Z

@SteveL-MSFT:

Re ASCII: I can do that, however:

I suggest you then somehow implement checks that prevent checking in files with non-ASCII characters.
Currently, the only instances of non-ASCII characters are in comments and informational text-only files - wouldn't it make more sense to show the actual copyright symbol than a u{...} escape sequence, for instance?

Again, going with the already agreed-upon BOM-less UTF-8 as the implied default encoding seems like the better choice to me.
Similarly, we then need a check that prevents checking in invalid-as-UTF-8 files (for ASCII-characters-only files, this will never be a concern).

Re newlines:

Of the 3053 text files currently in the repo, only 366 are currently CRLF; the remaining 2687 already are LF-only files.

What benefit is there in allowing this mix of newline styles?

As long as we make the tests aware of platform-typical checkout behavior, we can continue to let git's core.autocrlf do its to-and-fro CRLF-LF conversion, although I like @rkeithhill's idea of enforcing LF on checkout as part of the repo, which would simplify tests.

@rkeithhill: Is adding that one line to .gitattributes all that is needed?

SteveL-MSFT · 2017-09-07T01:40:23Z

@mklement0 for comments and markdown files, I think using (C) or (R) is sufficient, the escape sequence was specifically about code

I have noticed that sometimes after I push a change, GitHub doesn't display the diff correctly because of line endings (I usually end up resaving it as LF which seems to fix the problem so I suspect this is the issue). So if converting the rest to have consistency and potentially resolves this problem, I'm ok with having everything LF on the repo side.

SteveL-MSFT

File encoding should be ASCII. Ok with line endings being LF.

lzybkr · 2017-09-07T02:31:01Z

I'm not OK with LF - we should use "native" line endings where most of the developers are. Today and for the foreseeable future, that skews towards Windows.

If you are finding consistency issues though, I'm happy to see that fixed, just not moving to LF.

mklement0 · 2017-09-07T04:19:22Z

I'll list the commands that I ran on macOS 10.12.6 from bash alongside my findings so you can tell me if my methods are flawed.

If you are finding consistency issues though

# All text files that are in the repo.
$ git ls-files -z | xargs -0 file | awk -F: '$2 ~ /text/' > text-files.txt; wc -l < text-files.txt
2080

# Those that have CRLF line endings.
$ awk -F: '$2 ~ /CRLF/' text-files.txt | wc -l
107

# Those that are UTF-8 with a BOM - whether they contain actual non-ASCII chars. or not.
$ awk -F: '$2 ~ /UTF-8 Unicode \(with BOM\)/' text-files.txt | wc -l
400

# Those that are actually UTF-8 encoded, despite not having a BOM:
# They contain UTF-8-encoded non-ASCII characters.
# (All BOM-less files that have ASCII-only characters are reported as ASCII by `file`)
$ awk -F: '$2 ~ /UTF-8 Unicode text/' text-files.txt | wc -l
19

# Those that are Windows-1252 encoded (ISO-8859):
$ awk -F: '$2 ~ /ISO-8859/' text-files.txt | wc -l
12

# Those that contain invalid characters.
# They have no BOM and contain characters that are neither Windows-1252 nor UTF-8-encoded:
$ awk -F: '$2 ~ /Non-ISO/' text-files.txt | wc -l
4

As you can see, there are a number of inconsistencies.
Removing the | wc -l part lists the files with their paths relative to the repo root folder.

I think it makes sense to normalize the newlines to LF in the repo.

My understanding is that if that's all we do, the git default configuration on Windows will convert the LFs to CRLFs on checkout, and reconvert on committing.

Re conversion to ASCII; the non-ASCII characters in the current text files fall into the following categories:

copyright, trademark, and registered trademark symbols:
- in comments and informational text files - they're candidates for replacement with (C), (TM), (RM), correct?
  - some files, such as license_thirdparty_proprietary.txt, contain a lot of non-ASCII characters (e.g., French text) that are probably better left alone.
- in string literals assigned to properties in *.psd1 files - they're candidates for using Unicode escapes such as `u{a9}, correct?
XML files whose declaration explicitly states UTF-8 - probably better left alone.
I'm not sure where the help *.txt files come from, but many of them needlessly contain non-ASCII punctuation that could easily be replaced with ASCII equivalents - they're UTF-8 files with BOM and have CRLF line endings.

Should the files that do need to retain non-ASCII characters be normalized with respect to encoding and BOM? I assume UTF-8 is the way to go, but consistently with or consistently without BOM (my vote)?

Please let me know exactly how you want to resolve all these inconsistencies before I take another stab at it.

mklement0 · 2017-09-07T04:22:23Z

As an aside: Many *.cs files have mangled headers with only a closing XML tag in the comment at the top:

Correct:

// <copyright file="ShowCommandParameterInfo.cs" company="Microsoft">
//     Copyright © Microsoft Corporation.  All rights reserved.
// </copyright>

Incorrect (e.g., src/Microsoft.PowerShell.Commands.Utility/commands/utility/ShowCommand/ShowCommandParameterSetInfo.cs):

//-----------------------------------------------------------------------
//     Copyright © Microsoft Corporation.  All rights reserved.
// </copyright>
//-----------------------------------------------------------------------

SteveL-MSFT · 2017-09-07T16:08:04Z

@lzybkr git converts on the client the native line endings so on Windows, you end up with CRLF, so I'm not sure why you are against having just LF for the files in the repo. Conversely, I would be fine having all of them CRLF since git will convert to just LF on Linux/Mac, but it's an extra character for every line.

lzybkr · 2017-09-07T16:14:40Z

One additional reason to not change line endings - git blame will blame @mklement0 for nearly everything going forward until those lines change again - in other words, useful information in the history is obscured.

lzybkr · 2017-09-07T16:22:16Z

My original reason was that git stores the head for fast checkouts. By storing LF - git is forced to convert on checkout on Windows but not elsewhere - making checkout slower where more development is done.

If more development was done on Linux/Mac - then there might be an argument for changing the line endings if you didn't mind having basically every line of code appear to be authored by the same person.

Note that we can change the author for such a commit - we've used Author: PowerShell Team <PowerShellTeam@hotmail.com>, but we did that early on before we had a lot of history. Doing so now would make git blame a bit less useful (extra work to find real author of the code you are looking at.)

lzybkr · 2017-09-07T16:25:23Z

Oh - one more reason - git automatically converts only if you set git up correctly. Not everybody does that, and it's a pain when they don't. I speak from experience here - it wastes maintainers time helping contributors get set up correctly, fix their commits, etc.

So now that I've paged back in a bunch of reasons why I'm opposed to this PR, I'll actually strongly opposed to this PR as is.

rkeithhill · 2017-09-07T16:26:26Z

@mklement0 I see different stats for EOL types:

95:75ms> git ls-files --eol | %{$p = $_ -split '\s+'; [pscustomobject]@{Index=$p[0];WorkDir=$p[1];Attr=$p[2];Path=$p[3]
} } | group Index

Count Name                      Group
----- ----                      -----
 1972 i/lf                      {@{Index=i/lf; WorkDir=w/crlf; Attr=attr/; Path=.gitattributes}, @{Index=i/lf; WorkD...
   53 i/-text                   {@{Index=i/-text; WorkDir=w/-text; Attr=attr/; Path=.github/Images/Github-PR-dev.png...
  100 i/crlf                    {@{Index=i/crlf; WorkDir=w/crlf; Attr=attr/; Path=assets/license.rtf}, @{Index=i/crl...
    4 i/none                    {@{Index=i/none; WorkDir=w/none; Attr=attr/; Path=src/Microsoft.PowerShell.CoreCLR.A...
    2 i/                        {@{Index=i/; WorkDir=w/; Attr=attr/; Path=src/Modules/Shared/Pester}, @{Index=i/; Wo...
    5 i/mixed                   {@{Index=i/mixed; WorkDir=w/mixed; Attr=attr/; Path=src/System.Management.Automation...

96:84ms> git ls-files --eol | %{$p = $_ -split '\s+'; [pscustomobject]@{Index=$p[0];WorkDir=$p[1];Attr=$p[2];Path=$p[3]
} } | group WorkDir

Count Name                      Group
----- ----                      -----
 2072 w/crlf                    {@{Index=i/lf; WorkDir=w/crlf; Attr=attr/; Path=.gitattributes}, @{Index=i/lf; WorkD...
   53 w/-text                   {@{Index=i/-text; WorkDir=w/-text; Attr=attr/; Path=.github/Images/Github-PR-dev.png...
    4 w/none                    {@{Index=i/none; WorkDir=w/none; Attr=attr/; Path=src/Microsoft.PowerShell.CoreCLR.A...
    2 w/                        {@{Index=i/; WorkDir=w/; Attr=attr/; Path=src/Modules/Shared/Pester}, @{Index=i/; Wo...
    5 w/mixed                   {@{Index=i/mixed; WorkDir=w/mixed; Attr=attr/; Path=src/System.Management.Automation...

Also, to prevent individuals with varying autocrlf settings, perhaps the .gitattributes file should enforce whatever decision is made here:

https://help.github.com/articles/dealing-with-line-endings/#per-repository-settings

rkeithhill · 2017-09-07T16:28:41Z

Here are the 100 files stored in the repo (index) as CRLF:


Index  WorkDir Attr  Path
-----  ------- ----  ----
i/crlf w/crlf  attr/ assets/license.rtf
i/crlf w/crlf  attr/ docker/release/nanoserver-insider/Dockerfile
i/crlf w/crlf  attr/ src/Microsoft.PowerShell.Commands.Diagnostics/CommonUtils.cs
i/crlf w/crlf  attr/ src/Microsoft.PowerShell.PSReadLine/en-US/PSReadline.md
i/crlf w/crlf  attr/ src/Microsoft.PowerShell.Security/resources/CertificateProviderStrings.resx
i/crlf w/crlf  attr/ src/Modules/Shared/Microsoft.PowerShell.Host/Microsoft.PowerShell.Host.psd1
i/crlf w/crlf  attr/ src/Modules/Shared/Microsoft.PowerShell.Utility/Microsoft.PowerShell.Utility.psm1
i/crlf w/crlf  attr/ src/Modules/Unix/Microsoft.PowerShell.Management/Microsoft.PowerShell.Management.psd1
i/crlf w/crlf  attr/ src/Modules/Unix/Microsoft.PowerShell.Utility/Microsoft.PowerShell.Utility.psd1
i/crlf w/crlf  attr/ src/Modules/Windows-Core+Full/CimCmdlets/CimCmdlets.psd1
i/crlf w/crlf  attr/ src/Modules/Windows-Core+Full/Microsoft.WSMan.Management/Microsoft.WSMan.Management.psd1
i/crlf w/crlf  attr/ src/Modules/Windows-Core+Full/Microsoft.WSMan.Management/WSMan.format.ps1xml
i/crlf w/crlf  attr/ src/Modules/Windows-Core+Full/PSDiagnostics/PSDiagnostics.psd1
i/crlf w/crlf  attr/ src/Modules/Windows-Core+Full/PSDiagnostics/PSDiagnostics.psm1
i/crlf w/crlf  attr/ src/Modules/Windows-Core/Microsoft.PowerShell.Diagnostics/Microsoft.PowerShell.Diagnostics.psd1
i/crlf w/crlf  attr/ src/Modules/Windows-Core/Microsoft.PowerShell.Management/Microsoft.PowerShell.Management.psd1
i/crlf w/crlf  attr/ src/Modules/Windows-Core/Microsoft.PowerShell.Utility/Microsoft.PowerShell.Utility.psd1
i/crlf w/crlf  attr/ src/Modules/Windows-Full/Microsoft.PowerShell.Diagnostics/Diagnostics.format.ps1xml
i/crlf w/crlf  attr/ src/Modules/Windows-Full/Microsoft.PowerShell.Diagnostics/Event.format.ps1xml
i/crlf w/crlf  attr/ src/Modules/Windows-Full/Microsoft.PowerShell.Diagnostics/GetEvent.types.ps1xml
i/crlf w/crlf  attr/ src/Modules/Windows-Full/Microsoft.PowerShell.Diagnostics/Microsoft.PowerShell.Diagnostics.psd1
i/crlf w/crlf  attr/ src/Modules/Windows-Full/Microsoft.PowerShell.LocalAccounts/Microsoft.PowerShell.LocalAccounts....
i/crlf w/crlf  attr/ src/Modules/Windows-Full/Microsoft.PowerShell.ODataUtils/Microsoft.PowerShell.ODataAdapter.ps1
i/crlf w/crlf  attr/ src/Modules/Windows-Full/Microsoft.PowerShell.ODataUtils/Microsoft.PowerShell.ODataUtils.psd1
i/crlf w/crlf  attr/ src/Modules/Windows-Full/Microsoft.PowerShell.ODataUtils/Microsoft.PowerShell.ODataUtils.psm1
i/crlf w/crlf  attr/ src/Modules/Windows-Full/Microsoft.PowerShell.ODataUtils/Microsoft.PowerShell.ODataUtilsHelper.ps1
i/crlf w/crlf  attr/ src/Modules/Windows-Full/Microsoft.PowerShell.ODataUtils/en-US/Microsoft.PowerShell.ODataUtilsS...
i/crlf w/crlf  attr/ src/Modules/Windows-Full/Microsoft.PowerShell.Utility/Microsoft.PowerShell.Utility.psd1
i/crlf w/crlf  attr/ src/Modules/Windows-Full/PSScheduledJob/PSScheduledJob.Format.ps1xml
i/crlf w/crlf  attr/ src/Modules/Windows-Full/PSScheduledJob/PSScheduledJob.types.ps1xml
i/crlf w/crlf  attr/ src/Schemas/PSMaml/Maml_HTML.xsl
i/crlf w/crlf  attr/ src/libpsl-native/src/createprocess.cpp
i/crlf w/crlf  attr/ src/libpsl-native/src/createprocess.h
i/crlf w/crlf  attr/ test/powershell/Language/Classes/MSFT_778492.psm1
i/crlf w/crlf  attr/ test/powershell/Language/Classes/ProtectedAccess.Tests.ps1
i/crlf w/crlf  attr/ test/powershell/Language/Classes/Scripting.Classes.Attributes.Tests.ps1
i/crlf w/crlf  attr/ test/powershell/Language/Classes/Scripting.Classes.BasicParsing.Tests.ps1
i/crlf w/crlf  attr/ test/powershell/Language/Classes/Scripting.Classes.Break.Tests.ps1
i/crlf w/crlf  attr/ test/powershell/Language/Classes/Scripting.Classes.Exceptions.Tests.ps1
i/crlf w/crlf  attr/ test/powershell/Language/Classes/Scripting.Classes.MiscOps.Tests.ps1
i/crlf w/crlf  attr/ test/powershell/Language/Classes/Scripting.Classes.Modules.Tests.ps1
i/crlf w/crlf  attr/ test/powershell/Language/Classes/scripting.Classes.NestedModules.tests.ps1
i/crlf w/crlf  attr/ test/powershell/Language/Classes/scripting.Classes.inheritance.tests.ps1
i/crlf w/crlf  attr/ test/powershell/Language/Classes/scripting.Classes.using.tests.ps1
i/crlf w/crlf  attr/ test/powershell/Language/Classes/scripting.enums.tests.ps1
i/crlf w/crlf  attr/ test/powershell/Language/CompletionTestSupport.psm1
i/crlf w/crlf  attr/ test/powershell/Language/Parser/Ast.Tests.ps1
i/crlf w/crlf  attr/ test/powershell/Language/Parser/AutomaticVariables.Tests.ps1
i/crlf w/crlf  attr/ test/powershell/Language/Parser/BNotOperator.Tests.ps1
i/crlf w/crlf  attr/ test/powershell/Language/Parser/Conversions.Tests.ps1
i/crlf w/crlf  attr/ test/powershell/Language/Parser/ExtensibleCompletion.Tests.ps1
i/crlf w/crlf  attr/ test/powershell/Language/Parser/LanguageAndParser.TestFollowup.Tests.ps1
i/crlf w/crlf  attr/ test/powershell/Language/Parser/MethodInvocation.Tests.ps1
i/crlf w/crlf  attr/ test/powershell/Language/Parser/ParameterBinding.Tests.ps1
i/crlf w/crlf  attr/ test/powershell/Language/Parser/Parser.Tests.ps1
i/crlf w/crlf  attr/ test/powershell/Language/Parser/RedirectionOperator.Tests.ps1
i/crlf w/crlf  attr/ test/powershell/Language/Parser/TypeAccelerator.Tests.ps1
i/crlf w/crlf  attr/ test/powershell/Language/Parser/UsingAssembly.Tests.ps1
i/crlf w/crlf  attr/ test/powershell/Language/Parser/UsingNamespace.Tests.ps1
i/crlf w/crlf  attr/ test/powershell/Language/Scripting/Debugging/DebuggerScriptTests.Tests.ps1
i/crlf w/crlf  attr/ test/powershell/Language/Scripting/Debugging/DebuggingInHost.Tests.ps1
i/crlf w/crlf  attr/ test/powershell/Language/Scripting/HashtableToPSCustomObjectConversion.Tests.ps1
i/crlf w/crlf  attr/ test/powershell/Language/Scripting/LineEndings.Tests.ps1
i/crlf w/crlf  attr/ test/powershell/Language/Scripting/NativeExecution/NativeCommandArguments.Tests.ps1
i/crlf w/crlf  attr/ test/powershell/Language/Scripting/NativeExecution/NativeStreams.Tests.ps1
i/crlf w/crlf  attr/ test/powershell/Modules/Microsoft.PowerShell.Security/GetCredential.Tests.ps1
i/crlf w/crlf  attr/ test/powershell/Modules/Microsoft.PowerShell.Security/TestData/CatalogTestData/UserConfigProv/D...
i/crlf w/crlf  attr/ test/powershell/Modules/Microsoft.PowerShell.Utility/Clear-Variable.Tests.ps1
i/crlf w/crlf  attr/ test/powershell/Modules/Microsoft.PowerShell.Utility/Eventing.Tests.ps1
i/crlf w/crlf  attr/ test/powershell/Modules/Microsoft.PowerShell.Utility/Export-Alias.Tests.ps1
i/crlf w/crlf  attr/ test/powershell/Modules/Microsoft.PowerShell.Utility/Get-Variable.Tests.ps1
i/crlf w/crlf  attr/ test/powershell/Modules/Microsoft.PowerShell.Utility/Import-Alias.Tests.ps1
i/crlf w/crlf  attr/ test/powershell/Modules/Microsoft.PowerShell.Utility/New-Variable.Tests.ps1
i/crlf w/crlf  attr/ test/powershell/Modules/Microsoft.PowerShell.Utility/Set-Variable.Tests.ps1
i/crlf w/crlf  attr/ test/powershell/Modules/Microsoft.PowerShell.Utility/Trace-Command.Tests.ps1
i/crlf w/crlf  attr/ test/powershell/Modules/Microsoft.PowerShell.Utility/Update-FormatData.Tests.ps1
i/crlf w/crlf  attr/ test/powershell/Modules/Microsoft.PowerShell.Utility/Update-TypeData.Tests.ps1
i/crlf w/crlf  attr/ test/powershell/Modules/Microsoft.PowerShell.Utility/assets/TestCsv2.csv
i/crlf w/crlf  attr/ test/powershell/Modules/Microsoft.PowerShell.Utility/assets/UpdateFormatDataTests.format.ps1xml
i/crlf w/crlf  attr/ test/powershell/Modules/Microsoft.WSMan.Management/CredSSP.Tests.ps1
i/crlf w/crlf  attr/ test/powershell/Provider/AutomountSubstDrive.ps1
i/crlf w/crlf  attr/ test/powershell/Provider/AutomountSubstDriveCore.ps1
i/crlf w/crlf  attr/ test/powershell/Provider/AutomountVHDDrive.ps1
i/crlf w/crlf  attr/ test/powershell/Provider/Pester.AutomountedDrives.Tests.ps1
i/crlf w/crlf  attr/ test/powershell/Provider/ProviderIntrinsics.Tests.ps1
i/crlf w/crlf  attr/ test/powershell/README.md
i/crlf w/crlf  attr/ test/powershell/engine/Api/Serialization.Tests.ps1
i/crlf w/crlf  attr/ test/powershell/engine/Basic/Attributes.Tests.ps1
i/crlf w/crlf  attr/ test/powershell/engine/Basic/DefaultCommands.Tests.ps1
i/crlf w/crlf  attr/ test/powershell/engine/Cdxml/Cdxml.Tests.ps1
i/crlf w/crlf  attr/ test/powershell/engine/Cdxml/assets/CimTest/CdxmlTest.psd1
i/crlf w/crlf  attr/ test/powershell/engine/Cdxml/assets/CimTest/CimTest.cdxml
i/crlf w/crlf  attr/ test/powershell/engine/Cdxml/assets/CimTest/CreateCimTest.mof
i/crlf w/crlf  attr/ test/powershell/engine/Cdxml/assets/CimTest/DeleteCimTest.mof
i/crlf w/crlf  attr/ test/powershell/engine/ETS/Adapter.Tests.ps1
i/crlf w/crlf  attr/ test/powershell/engine/ParameterBinding/BooleanParameterDCR.Tests.ps1
i/crlf w/crlf  attr/ test/tools/OpenCover/OpenCover.Types.ps1xml
i/crlf w/crlf  attr/ test/tools/OpenCover/OpenCover.psd1
i/crlf w/crlf  attr/ test/tools/TestExe/TestExe.cs
i/crlf w/crlf  attr/ test/tools/TestExe/TestExe.csproj

iSazonov · 2017-09-07T17:49:35Z

I suggest split the super large PR. May start with fixing copyright, trademark, and registered trademark symbols in comments and md files?

mklement0 · 2017-09-07T21:33:02Z

@rkeithhill: Great analysis (and nice combination of git and PowerShell) - good to know about git ls-files --eol: it is certainly the better tool to use for this analysis, though, as it turns out, my approach unwittingly unearthed edge cases:

The reasons for the CRLF-file-count discrepancy are:

file doesn't mention newlines for assets/license.rtf, just Rich Text Format data, version 1, unknown character set, so my CRLF substring matching didn't match it. [update: actually, the primary reason is that case-sensitive regex /text/ didn't match the Text in the RTF description, but even with that corrected, the argument still applies].
my CRLF substring matching included the 5 mixed newline-style files.
my CRLF substring matching matched 3 files that git considers binary (-text), because they contain CR-only newlines too (as an aside: VSCode apparently quietly discards such isolated CR instances on opening):

src/Schemas/PSMaml/Maml.tbr
src/System.Management.Automation/resources/CmdletizationCoreResources.resx
test/powershell/Modules/Microsoft.PowerShell.Security/TestData/CatalogTestData/CatalogTestFile1.mof

mklement0 · 2017-09-08T01:37:29Z

@lzybkr:

git's native newline style is LF.
git considers a file normalized with respect to newlines if it has LF newlines.
git's newline conversion features focus on locally, temporarily changing newline styles for the workspace, if needed, without affecting normalized storage in the repository.

With git's default configuration (core.autocrlf unset on Unix, set to true on Windows):

you have to go out of your way to commit a text file that is stored with CRLF newlines in the repository (such as explicitly creating a CRLF file on Unix and adding it to the repo there).
and if you do so, your workspaces on Unix will contain CRLF files (because no conversion is applied on checkout), which is problematic.

In other words: What you're strongly opposing is:

git's native behavior.
the state of roughly 95% of the text files currently in the repository (they're stored with LF newlines; that also means we're mostly already paying the performance penalty you mention on Windows).

@rkeithhill's suggestion is the way to go for enforcing platform-appropriate check-out and check-in behavior as part of the repo rather than relying on contributors' global git configuration to be correct (the problem you mentioned).

Specifically, this means adding the following line to the extant .gitattributes repo file:

* text=auto

To quote from man git-attrib:

If you want to ensure that text files that any contributor introduces to the repository have their line endings normalized, you can set the text attribute to "auto" for all files.

So, with respect to newlines, I think the way to go is:

Convert all CRLF-only, mixed-CRLF-LF/CR to LF-only files in the repo as a one-time action.
- If there are truly files that need CRLF line endings even on Unix, add them as exceptions to .gitattributes.
- If you "unblame" me, should my PR be accepted, please do so - we're talking about 103 files currently (the 100 @rkeithhill mentioned, plus the 3 I mentioned later (the ones with isolated CR instances)).
Add * text=auto to .gitattributes to ensure that appropriate to-and-from-the-workspace conversion happens on Windows, without relying on contributors' git installations to be configured correctly.

@iSazonov's suggestion to split this - definitely unusable in it current form - PR into multiple PRs makes sense, and I'm happy to tackle that, but I need consensus and clear guidance, not just on the newline issue, but also on the specific Unicode-character/encoding-normalization questions raised above.

iSazonov · 2017-09-08T04:38:16Z

If we have right git config in the repo users will get it after fork in local workspace - so I vote for LF in the repo.

We can fix some things out of the PR and we should do so. I ask to confirm that we should replace copyrights (and so on) with ASCII chars if we want ASCII files.

lzybkr · 2017-09-08T05:39:04Z

Primarily I object to the need to review hundreds of files. I can't in good conscience merge a change like this because I'm forced to review every single line or risk taking an inadvertent or possibly malicious change.

At a minimum, I would ask that you provide the scripts that made the changes and a PowerShell team member make the actual changes - that would address my concerns about needing to review every line.

It's much easier to review the BOM change - and I see the value in that as well, though we disagree on the approach to removing Unicode characters. I can repeat my objections - people do use tools that mess up those characters, and by sticking with ASCII (and enforcing it with tools), I can let my guard down some.

iSazonov · 2017-09-08T06:11:58Z

We could add the checks in Meta test and git hooks.

mklement0 · 2017-09-08T17:10:41Z

@lzybkr:

Re newlines: Understood; I'll list the shell commands I used in a separate comment.

Re encoding / Unicode characters:

Even if we go with the ASCII approach, we need a way to enforce the desired encoding going forward either way, so as to avoid sliding back into a confusion of encodings and newline styles over time.

As @iSazonov mentions, server-side Git hooks and possibly checks during CI testing can give us that.

With the ability to enforce the desired encoding / newlines styles, we may as well enforce BOM-less UTF-8, which radically simplifies things going forward, just like it will for end-users, once PS Core speaks BOM-less UTF-8 natively.

As an aside:

It's also worth considering supplying a custom .gitconfig file and a (client-side) pre-commit hook script as part of the repo to help with enforcing whitespace and encoding rules client-side.

While there is no fully automatic way to install them, a one-time setup script that must be run every time a local repo is set up would suffice - and, if feasible, a server-side check failing could recommend running that setup script in its error message.

mklement0 · 2017-09-08T23:24:07Z

`bash` shell commands for normalizing all line endings to LF

You should be able to run this on either macOS or Linux (I ran it on macOS 10.12.6), from the PS repo folder.
The only prerequisite is utility dos2unix; install it as follows:
- on Ubuntu, for instance: sudo apt-get install dos2unix
- on macOS, with Homebrew installed: brew install dos2unix

Note that the commands do 3 things:

all CRLF instances are converted to LF
remaining, stray CR-only instances are removed
a trailing LF is added to any single-line text files without a newline

The commands create aux. file test-files.txt in the repo folder - remove it once conversion is done.

# Save a list of all *text* files in 'text-files.txt', for use below.
git ls-files -z | xargs -0 file | awk -F: '$2 ~ /[Tt]ext/' > text-files.txt

#  The following converts all CR-containing files to LF-only newlines.
awk -F: '$2 ~ / CR/ { print $1 }' text-files.txt | xargs -I{} dos2unix {}

# Now remove any remaining *isolated CR* instances from those files that had them.:
awk -F: '/ CR[ ,]/ {print $1}' text-files.txt | xargs -I{} sh -c 'tr -d "\r" <"$1" >"$1.$$" && mv "$1.$$" "$1"' - {}

# Those text files that have NO newlines at all (single-line files with no trailing newline):
# Append LF to them
awk -F: '$2 ~ / no line terminators/ { print $1 }' text-files.txt | xargs -I{} sh -c 'printf "\n" >> "$1"' - {}

mklement0 · 2017-09-10T15:43:16Z

@SteveL-MSFT

After looking into this some more:

Conversely, I would be fine having all of them CRLF since git will convert to just LF on Linux/Mac

That's actually not how it works:
Since git is LF-centric, CRLF files stored as such in the repo are considered NON-text files, and as such the automatic newline-conversion configuration settings do not apply.

As non-text ("binary") files, they are checked-out and committed as-is, both on Windows and Unix, which means that you'd get CRLF newlines in your working tree on Unix (which is clearly the wrong thing to do).

As stated, you must go out of your way - either intentionally or accidentally - to even succeed at committing a CRLF file with CRLFs preserved (assign the -text (non-text) attribute explicitly to a text file in .gitattributes, or, in the absence of attributes, commit a CRLF file while core.autocrlf is not set to true).

Conversely, anything that git considers a text file invariably has its newlines normalized to LF on staging (and therefore on committing) - even if a working-tree file's newline style doesn't match the expected one with respect to git's settings.

SteveL-MSFT · 2017-09-10T15:57:12Z

@mklement0 thanks for the insights. I'll take care of this based on the commands you provided and will submit as new PR.

mklement0 · 2017-09-11T04:17:17Z

Thanks, @SteveL-MSFT.

Below are commands for assisting with normalizing the encoding of all files to BOM-less UTF-8.

Even if you decide to go with ASCII, after all (boo, hiss), you can use the normalized files as a consistent starting point for the to-ASCII conversion; utility print-nonascii can help with locating the lines that require changes (see below).

Note the source-code comments re caveats.

I'm closing this PR, and I'm leaving this to you and your team to complete.

`bash` shell commands for normalizing all encodings to BOM-less UTF-8

You should be able to run this on either macOS or Linux (I ran it on macOS 10.12.6), from the PS repo folder.
The only prerequisite is utility print-nonascii; install it as follows, via npm (comes with Node.js):
- npm i -g print-nonascii
- For manual installation, see the repo.
The commands rely on aux. file test-files.txt, as created by the newline-normalization commands above.

# CAVEAT: RTF, XML, HTML and other files that specify the encoding *as part of the file* should be EXCLUDED below,
#         because re-encoding could leave such files in an inconsistent state.

# Convert all with-BOM UTF-8 files to BOM-less UTF-8
awk -F: '$2 ~ /UTF-8 Unicode \(with BOM\)/ { print $1 }' text-files.txt | xargs -I{} bash -c 'tail -c +4 "$1" > "$1.$$" && mv "$1.$$" "$1"' - {}

# Convert all Windows-1252 / ISO-8859 files to BOM-less UTF-8
awk -F: '$2 ~ /ISO-8859/ { print $1 }' text-files.txt | xargs -I{} bash -c 'iconv -f CP1252 -t UTF-8 "$1" > "$1.$$" && mv "$1.$$" "$1"' - {}

# There are Windows-1252 characters that are missing from ISO-8859-1 such as "curly quotes".
# `file` only knows about ISO-8859-1, so it reports files with such characters as "Non-ISO".
# We *assume* that such files are Windows-1252 files, and convert them to BOM-less UTF-8 based on that assumption.
# NOTE: If the assumption doesn't hold, characters may be *lost*, so this command may have to be
#       undone.
awk -F: '$2 ~ /Non-ISO/ { print $1 }' text-files.txt | xargs -I{} bash -c 'iconv -f CP1252 -t UTF-8 "$1" > "$1.$$" && mv "$1.$$" "$1"' - {}

After the conversion, use the following print-nonascii command to print lines with non-ASCII characters from all text files and visually verify that the re-encoding was successful in all cases.

Each line with one or more non-ASCII character is printed twice, preceded by the line number:

once as-is
again with PowerShell-style Unicode escapes; e.g., `u{2013} for – (EN DASH)

Note that there are non-obvious non-ASCII characters such as punctuation that outwardly resembles ASCII punctuation, such as the EN DASH example;
similarly, there are instances of non-ASCII quotes.

# Print lines that contain non-ASCII chars. across all text files.
awk -F: '{ print $1 }' text-files.txt | tr '\n' '\0' | xargs -0 print-nonascii -n --psh -r

The above yields, for instance (only a few lines shown):

###	CHANGELOG.md
100:* Fix an issue in implicit remoting where restricted sessions couldn't use `Get-FormatData –PowerShellVersion`. (#4222)
100:* Fix an issue in implicit remoting where restricted sessions couldn't use `Get-FormatData `u{2013}PowerShellVersion`. (#4222)
###	ThirdPartyNotices.txt
5:The software is based on or incorporates material from the projects listed below (collectively, “Third Party Code”).  Microsoft is not the original author of the Third Party Code.  The original copyright notice and license, under which Microsoft received such Third Party Code, are set forth below. Microsoft reserves all rights not expressly granted herein, whether by implication, estoppel or otherwise.
5:The software is based on or incorporates material from the projects listed below (collectively, `u{201c}Third Party Code`u{201d}).  Microsoft is not the original author of the Third Party Code.  The original copyright notice and license, under which Microsoft received such Third Party Code, are set forth below. Microsoft reserves all rights not expressly granted herein, whether by implication, estoppel or otherwise.
###	assets/AppImageThirdPartyNotices.txt
8:Copyright © 1991-2016 Unicode, Inc. All rights reserved.
8:Copyright `u{a9} 1991-2016 Unicode, Inc. All rights reserved.
...

SteveL-MSFT · 2017-09-20T04:38:06Z

I'll try to take care of this on the weekend so there isn't so many PRs causing merge conflicts.

rkeithhill · 2017-09-24T20:46:01Z

@SteveL-MSFT What do you think about the suggested change to .gitattributes to enforce the desired line-endings for this repo?

mklement0 · 2017-09-24T22:50:58Z

Additionally, let me reiterate the need for (at least) a server-side hook that rejects improperly encoded text files (except deliberately differently encoded test files, if applicable).

SteveL-MSFT · 2017-09-24T22:53:33Z

I'm prepping a PR to at least change the .gitattributes to enforce it so we'll slowly get there. I'll bulk fix the rest after 6.0.0. Also doing some other fixes in this PR I'm submitting soon (like copyright symbol)

mklement0 added 4 commits September 6, 2017 00:12

CRLF converted to LF, trailing LF ensured.

ad0c5d3

Fixed/removed invalid characters.

2236bcc

Windows-1252-encoded files converted to BOM-less UTF-8 files.

e731069

UTF-8 BOM removed from all files that had one.

3bfbf78

mklement0 requested review from BrucePay, Francisco-Gamino, HemantMahawar, JamesWTruher, PaulHigin, SteveL-MSFT, TravisEz13, adityapatwardhan, anmenaga, charub, dantraMSFT, daxian-dbw, joeyaiello, lzybkr, mirichmo and vors as code owners September 6, 2017 18:45

msftclas added the cla-already-signed label Sep 6, 2017

mklement0 changed the title ~~Implements #4752, and more: line endings standardized, converted all file to BOM-less UTF-8~~ Implements #4752, and more: line endings standardized, converted all files to BOM-less UTF-8 Sep 6, 2017

Implements PowerShell#4752 and more [feature]: line endings standardi…

588a3b8

…zed, converted all file to BOM-less UTF-8

SteveL-MSFT requested changes Sep 7, 2017

View reviewed changes

mklement0 closed this Sep 11, 2017

mklement0 deleted the source-encoding-newline-fix branch September 11, 2017 04:19

SteveL-MSFT mentioned this pull request Sep 14, 2017

Updated default ModuleVersion in ModuleManifest #4842

Merged

SteveL-MSFT mentioned this pull request Sep 24, 2017

Convert all files to ASCII encoding #4910

Closed

SteveL-MSFT mentioned this pull request Sep 24, 2017

Enable auto EOL on git repo side, fix some character encoding issues #4912

Merged

mklement0 mentioned this pull request Oct 3, 2019

Provide simple one-time setting of environment variable #3316

Open

Implements #4752, and more: line endings standardized, converted all files to BOM-less UTF-8 #4761

Implements #4752, and more: line endings standardized, converted all files to BOM-less UTF-8 #4761

Uh oh!

Conversation

mklement0 commented Sep 6, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

msftclas commented Sep 6, 2017

Uh oh!

SteveL-MSFT commented Sep 6, 2017

Uh oh!

rkeithhill commented Sep 6, 2017

Uh oh!

SteveL-MSFT commented Sep 7, 2017

Uh oh!

mklement0 commented Sep 7, 2017

Uh oh!

SteveL-MSFT commented Sep 7, 2017

Uh oh!

SteveL-MSFT left a comment

Choose a reason for hiding this comment

Uh oh!

lzybkr commented Sep 7, 2017

Uh oh!

mklement0 commented Sep 7, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mklement0 commented Sep 7, 2017

Uh oh!

SteveL-MSFT commented Sep 7, 2017

Uh oh!

lzybkr commented Sep 7, 2017

Uh oh!

lzybkr commented Sep 7, 2017

Uh oh!

lzybkr commented Sep 7, 2017

Uh oh!

rkeithhill commented Sep 7, 2017

Uh oh!

rkeithhill commented Sep 7, 2017

Uh oh!

iSazonov commented Sep 7, 2017

Uh oh!

mklement0 commented Sep 7, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mklement0 commented Sep 8, 2017

Uh oh!

iSazonov commented Sep 8, 2017

Uh oh!

lzybkr commented Sep 8, 2017

Uh oh!

iSazonov commented Sep 8, 2017

Uh oh!

mklement0 commented Sep 8, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mklement0 commented Sep 8, 2017

bash shell commands for normalizing all line endings to LF

Uh oh!

mklement0 commented Sep 10, 2017

Uh oh!

SteveL-MSFT commented Sep 10, 2017

Uh oh!

mklement0 commented Sep 11, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

bash shell commands for normalizing all encodings to BOM-less UTF-8

Uh oh!

SteveL-MSFT commented Sep 20, 2017

Uh oh!

rkeithhill commented Sep 24, 2017

Uh oh!

mklement0 commented Sep 24, 2017

Uh oh!

SteveL-MSFT commented Sep 24, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

mklement0 commented Sep 6, 2017 •

edited

Loading

mklement0 commented Sep 7, 2017 •

edited

Loading

mklement0 commented Sep 7, 2017 •

edited

Loading

mklement0 commented Sep 8, 2017 •

edited

Loading

`bash` shell commands for normalizing all line endings to LF

mklement0 commented Sep 11, 2017 •

edited

Loading

`bash` shell commands for normalizing all encodings to BOM-less UTF-8