Skip to content

Conversation

@mklement0
Copy link
Contributor

@mklement0 mklement0 commented Sep 6, 2017

Implements #4752, though not exactly as asked:
This PR does not replace the copyright/trademark symbols with their equivalent Unicode escape sequences, but instead - in line with BOM-less UTF-8 becoming the default - performs the following tasks:

  • all CRLF newlines were converted to LF-only newlines

  • all (by definition BOM-less) Windows-1252-encoded files were converted to BOM-less UTF-8.

  • all UTF-8 files with BOM had their BOM removed.

  • a few files contained invalid characters; they were replaced/removed.

In other words: all files are now BOM-less UTF-8 files with LF-only line endings, and, ideally, only such files should be added in the future.

@msftclas
Copy link

msftclas commented Sep 6, 2017

@mklement0,
Thanks for having already signed the Contribution License Agreement. Your agreement was validated by Microsoft. We will now review your pull request.
Thanks,
Microsoft Pull Request Bot

@SteveL-MSFT
Copy link
Member

Please fix the test failures. Also, submit with [feature] as first line in commit description so the feature tests get run.

@mklement0 mklement0 changed the title Implements #4752, and more: line endings standardized, converted all file to BOM-less UTF-8 Implements #4752, and more: line endings standardized, converted all files to BOM-less UTF-8 Sep 6, 2017
…zed, converted all file to BOM-less UTF-8
@rkeithhill
Copy link
Collaborator

I get the re-encoding work but why change the line-endings to LF? Why not let autocrlf continue to do its magic? If you do go through with this change, then you should update .gitattributes to enforce the EOL seq e.g * text eol=lf.

@SteveL-MSFT
Copy link
Member

Based on discussion with @lzybkr we both agree that our files should just be ascii. Any characters outside of that range should be escaped appropriately. Also, agree with @rkeithhill that line endings shouldn't change as git will convert as necessary.

@mklement0
Copy link
Contributor Author

@SteveL-MSFT:

Re ASCII: I can do that, however:

  • I suggest you then somehow implement checks that prevent checking in files with non-ASCII characters.

  • Currently, the only instances of non-ASCII characters are in comments and informational text-only files - wouldn't it make more sense to show the actual copyright symbol than a u{...} escape sequence, for instance?

Again, going with the already agreed-upon BOM-less UTF-8 as the implied default encoding seems like the better choice to me.
Similarly, we then need a check that prevents checking in invalid-as-UTF-8 files (for ASCII-characters-only files, this will never be a concern).


Re newlines:

Of the 3053 text files currently in the repo, only 366 are currently CRLF; the remaining 2687 already are LF-only files.

What benefit is there in allowing this mix of newline styles?

As long as we make the tests aware of platform-typical checkout behavior, we can continue to let git's core.autocrlf do its to-and-fro CRLF-LF conversion, although I like @rkeithhill's idea of enforcing LF on checkout as part of the repo, which would simplify tests.

@rkeithhill: Is adding that one line to .gitattributes all that is needed?

@SteveL-MSFT
Copy link
Member

@mklement0 for comments and markdown files, I think using (C) or (R) is sufficient, the escape sequence was specifically about code

I have noticed that sometimes after I push a change, GitHub doesn't display the diff correctly because of line endings (I usually end up resaving it as LF which seems to fix the problem so I suspect this is the issue). So if converting the rest to have consistency and potentially resolves this problem, I'm ok with having everything LF on the repo side.

Copy link
Member

@SteveL-MSFT SteveL-MSFT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

File encoding should be ASCII. Ok with line endings being LF.

@lzybkr
Copy link
Contributor

lzybkr commented Sep 7, 2017

I'm not OK with LF - we should use "native" line endings where most of the developers are. Today and for the foreseeable future, that skews towards Windows.

If you are finding consistency issues though, I'm happy to see that fixed, just not moving to LF.

@mklement0
Copy link
Contributor Author

mklement0 commented Sep 7, 2017

I'll list the commands that I ran on macOS 10.12.6 from bash alongside my findings so you can tell me if my methods are flawed.

If you are finding consistency issues though

# All text files that are in the repo.
$ git ls-files -z | xargs -0 file | awk -F: '$2 ~ /text/' > text-files.txt; wc -l < text-files.txt
2080

# Those that have CRLF line endings.
$ awk -F: '$2 ~ /CRLF/' text-files.txt | wc -l
107

# Those that are UTF-8 with a BOM - whether they contain actual non-ASCII chars. or not.
$ awk -F: '$2 ~ /UTF-8 Unicode \(with BOM\)/' text-files.txt | wc -l
400

# Those that are actually UTF-8 encoded, despite not having a BOM:
# They contain UTF-8-encoded non-ASCII characters.
# (All BOM-less files that have ASCII-only characters are reported as ASCII by `file`)
$ awk -F: '$2 ~ /UTF-8 Unicode text/' text-files.txt | wc -l
19

# Those that are Windows-1252 encoded (ISO-8859):
$ awk -F: '$2 ~ /ISO-8859/' text-files.txt | wc -l
12

# Those that contain invalid characters.
# They have no BOM and contain characters that are neither Windows-1252 nor UTF-8-encoded:
$ awk -F: '$2 ~ /Non-ISO/' text-files.txt | wc -l
4

As you can see, there are a number of inconsistencies.
Removing the | wc -l part lists the files with their paths relative to the repo root folder.

I think it makes sense to normalize the newlines to LF in the repo.

My understanding is that if that's all we do, the git default configuration on Windows will convert the LFs to CRLFs on checkout, and reconvert on committing.

Re conversion to ASCII; the non-ASCII characters in the current text files fall into the following categories:

  • copyright, trademark, and registered trademark symbols:

    • in comments and informational text files - they're candidates for replacement with (C), (TM), (RM), correct?

      • some files, such as license_thirdparty_proprietary.txt, contain a lot of non-ASCII characters (e.g., French text) that are probably better left alone.
    • in string literals assigned to properties in *.psd1 files - they're candidates for using Unicode escapes such as `u{a9}, correct?

  • XML files whose declaration explicitly states UTF-8 - probably better left alone.

  • I'm not sure where the help *.txt files come from, but many of them needlessly contain non-ASCII punctuation that could easily be replaced with ASCII equivalents - they're UTF-8 files with BOM and have CRLF line endings.

Should the files that do need to retain non-ASCII characters be normalized with respect to encoding and BOM? I assume UTF-8 is the way to go, but consistently with or consistently without BOM (my vote)?

Please let me know exactly how you want to resolve all these inconsistencies before I take another stab at it.

@mklement0
Copy link
Contributor Author

As an aside: Many *.cs files have mangled headers with only a closing XML tag in the comment at the top:

Correct:

// <copyright file="ShowCommandParameterInfo.cs" company="Microsoft">
//     Copyright © Microsoft Corporation.  All rights reserved.
// </copyright>

Incorrect (e.g., src/Microsoft.PowerShell.Commands.Utility/commands/utility/ShowCommand/ShowCommandParameterSetInfo.cs):

//-----------------------------------------------------------------------
//     Copyright © Microsoft Corporation.  All rights reserved.
// </copyright>
//-----------------------------------------------------------------------

@SteveL-MSFT
Copy link
Member

@lzybkr git converts on the client the native line endings so on Windows, you end up with CRLF, so I'm not sure why you are against having just LF for the files in the repo. Conversely, I would be fine having all of them CRLF since git will convert to just LF on Linux/Mac, but it's an extra character for every line.

@lzybkr
Copy link
Contributor

lzybkr commented Sep 7, 2017

One additional reason to not change line endings - git blame will blame @mklement0 for nearly everything going forward until those lines change again - in other words, useful information in the history is obscured.

@lzybkr
Copy link
Contributor

lzybkr commented Sep 7, 2017

My original reason was that git stores the head for fast checkouts. By storing LF - git is forced to convert on checkout on Windows but not elsewhere - making checkout slower where more development is done.

If more development was done on Linux/Mac - then there might be an argument for changing the line endings if you didn't mind having basically every line of code appear to be authored by the same person.

Note that we can change the author for such a commit - we've used Author: PowerShell Team <PowerShellTeam@hotmail.com>, but we did that early on before we had a lot of history. Doing so now would make git blame a bit less useful (extra work to find real author of the code you are looking at.)

@lzybkr
Copy link
Contributor

lzybkr commented Sep 7, 2017

Oh - one more reason - git automatically converts only if you set git up correctly. Not everybody does that, and it's a pain when they don't. I speak from experience here - it wastes maintainers time helping contributors get set up correctly, fix their commits, etc.

So now that I've paged back in a bunch of reasons why I'm opposed to this PR, I'll actually strongly opposed to this PR as is.

@rkeithhill
Copy link
Collaborator

@mklement0 I see different stats for EOL types:

95:75ms> git ls-files --eol | %{$p = $_ -split '\s+'; [pscustomobject]@{Index=$p[0];WorkDir=$p[1];Attr=$p[2];Path=$p[3]
} } | group Index

Count Name                      Group
----- ----                      -----
 1972 i/lf                      {@{Index=i/lf; WorkDir=w/crlf; Attr=attr/; Path=.gitattributes}, @{Index=i/lf; WorkD...
   53 i/-text                   {@{Index=i/-text; WorkDir=w/-text; Attr=attr/; Path=.github/Images/Github-PR-dev.png...
  100 i/crlf                    {@{Index=i/crlf; WorkDir=w/crlf; Attr=attr/; Path=assets/license.rtf}, @{Index=i/crl...
    4 i/none                    {@{Index=i/none; WorkDir=w/none; Attr=attr/; Path=src/Microsoft.PowerShell.CoreCLR.A...
    2 i/                        {@{Index=i/; WorkDir=w/; Attr=attr/; Path=src/Modules/Shared/Pester}, @{Index=i/; Wo...
    5 i/mixed                   {@{Index=i/mixed; WorkDir=w/mixed; Attr=attr/; Path=src/System.Management.Automation...

96:84ms> git ls-files --eol | %{$p = $_ -split '\s+'; [pscustomobject]@{Index=$p[0];WorkDir=$p[1];Attr=$p[2];Path=$p[3]
} } | group WorkDir

Count Name                      Group
----- ----                      -----
 2072 w/crlf                    {@{Index=i/lf; WorkDir=w/crlf; Attr=attr/; Path=.gitattributes}, @{Index=i/lf; WorkD...
   53 w/-text                   {@{Index=i/-text; WorkDir=w/-text; Attr=attr/; Path=.github/Images/Github-PR-dev.png...
    4 w/none                    {@{Index=i/none; WorkDir=w/none; Attr=attr/; Path=src/Microsoft.PowerShell.CoreCLR.A...
    2 w/                        {@{Index=i/; WorkDir=w/; Attr=attr/; Path=src/Modules/Shared/Pester}, @{Index=i/; Wo...
    5 w/mixed                   {@{Index=i/mixed; WorkDir=w/mixed; Attr=attr/; Path=src/System.Management.Automation...

Also, to prevent individuals with varying autocrlf settings, perhaps the .gitattributes file should enforce whatever decision is made here:

https://help.github.com/articles/dealing-with-line-endings/#per-repository-settings

@rkeithhill
Copy link
Collaborator

Here are the 100 files stored in the repo (index) as CRLF:


Index  WorkDir Attr  Path
-----  ------- ----  ----
i/crlf w/crlf  attr/ assets/license.rtf
i/crlf w/crlf  attr/ docker/release/nanoserver-insider/Dockerfile
i/crlf w/crlf  attr/ src/Microsoft.PowerShell.Commands.Diagnostics/CommonUtils.cs
i/crlf w/crlf  attr/ src/Microsoft.PowerShell.PSReadLine/en-US/PSReadline.md
i/crlf w/crlf  attr/ src/Microsoft.PowerShell.Security/resources/CertificateProviderStrings.resx
i/crlf w/crlf  attr/ src/Modules/Shared/Microsoft.PowerShell.Host/Microsoft.PowerShell.Host.psd1
i/crlf w/crlf  attr/ src/Modules/Shared/Microsoft.PowerShell.Utility/Microsoft.PowerShell.Utility.psm1
i/crlf w/crlf  attr/ src/Modules/Unix/Microsoft.PowerShell.Management/Microsoft.PowerShell.Management.psd1
i/crlf w/crlf  attr/ src/Modules/Unix/Microsoft.PowerShell.Utility/Microsoft.PowerShell.Utility.psd1
i/crlf w/crlf  attr/ src/Modules/Windows-Core+Full/CimCmdlets/CimCmdlets.psd1
i/crlf w/crlf  attr/ src/Modules/Windows-Core+Full/Microsoft.WSMan.Management/Microsoft.WSMan.Management.psd1
i/crlf w/crlf  attr/ src/Modules/Windows-Core+Full/Microsoft.WSMan.Management/WSMan.format.ps1xml
i/crlf w/crlf  attr/ src/Modules/Windows-Core+Full/PSDiagnostics/PSDiagnostics.psd1
i/crlf w/crlf  attr/ src/Modules/Windows-Core+Full/PSDiagnostics/PSDiagnostics.psm1
i/crlf w/crlf  attr/ src/Modules/Windows-Core/Microsoft.PowerShell.Diagnostics/Microsoft.PowerShell.Diagnostics.psd1
i/crlf w/crlf  attr/ src/Modules/Windows-Core/Microsoft.PowerShell.Management/Microsoft.PowerShell.Management.psd1
i/crlf w/crlf  attr/ src/Modules/Windows-Core/Microsoft.PowerShell.Utility/Microsoft.PowerShell.Utility.psd1
i/crlf w/crlf  attr/ src/Modules/Windows-Full/Microsoft.PowerShell.Diagnostics/Diagnostics.format.ps1xml
i/crlf w/crlf  attr/ src/Modules/Windows-Full/Microsoft.PowerShell.Diagnostics/Event.format.ps1xml
i/crlf w/crlf  attr/ src/Modules/Windows-Full/Microsoft.PowerShell.Diagnostics/GetEvent.types.ps1xml
i/crlf w/crlf  attr/ src/Modules/Windows-Full/Microsoft.PowerShell.Diagnostics/Microsoft.PowerShell.Diagnostics.psd1
i/crlf w/crlf  attr/ src/Modules/Windows-Full/Microsoft.PowerShell.LocalAccounts/Microsoft.PowerShell.LocalAccounts....
i/crlf w/crlf  attr/ src/Modules/Windows-Full/Microsoft.PowerShell.ODataUtils/Microsoft.PowerShell.ODataAdapter.ps1
i/crlf w/crlf  attr/ src/Modules/Windows-Full/Microsoft.PowerShell.ODataUtils/Microsoft.PowerShell.ODataUtils.psd1
i/crlf w/crlf  attr/ src/Modules/Windows-Full/Microsoft.PowerShell.ODataUtils/Microsoft.PowerShell.ODataUtils.psm1
i/crlf w/crlf  attr/ src/Modules/Windows-Full/Microsoft.PowerShell.ODataUtils/Microsoft.PowerShell.ODataUtilsHelper.ps1
i/crlf w/crlf  attr/ src/Modules/Windows-Full/Microsoft.PowerShell.ODataUtils/en-US/Microsoft.PowerShell.ODataUtilsS...
i/crlf w/crlf  attr/ src/Modules/Windows-Full/Microsoft.PowerShell.Utility/Microsoft.PowerShell.Utility.psd1
i/crlf w/crlf  attr/ src/Modules/Windows-Full/PSScheduledJob/PSScheduledJob.Format.ps1xml
i/crlf w/crlf  attr/ src/Modules/Windows-Full/PSScheduledJob/PSScheduledJob.types.ps1xml
i/crlf w/crlf  attr/ src/Schemas/PSMaml/Maml_HTML.xsl
i/crlf w/crlf  attr/ src/libpsl-native/src/createprocess.cpp
i/crlf w/crlf  attr/ src/libpsl-native/src/createprocess.h
i/crlf w/crlf  attr/ test/powershell/Language/Classes/MSFT_778492.psm1
i/crlf w/crlf  attr/ test/powershell/Language/Classes/ProtectedAccess.Tests.ps1
i/crlf w/crlf  attr/ test/powershell/Language/Classes/Scripting.Classes.Attributes.Tests.ps1
i/crlf w/crlf  attr/ test/powershell/Language/Classes/Scripting.Classes.BasicParsing.Tests.ps1
i/crlf w/crlf  attr/ test/powershell/Language/Classes/Scripting.Classes.Break.Tests.ps1
i/crlf w/crlf  attr/ test/powershell/Language/Classes/Scripting.Classes.Exceptions.Tests.ps1
i/crlf w/crlf  attr/ test/powershell/Language/Classes/Scripting.Classes.MiscOps.Tests.ps1
i/crlf w/crlf  attr/ test/powershell/Language/Classes/Scripting.Classes.Modules.Tests.ps1
i/crlf w/crlf  attr/ test/powershell/Language/Classes/scripting.Classes.NestedModules.tests.ps1
i/crlf w/crlf  attr/ test/powershell/Language/Classes/scripting.Classes.inheritance.tests.ps1
i/crlf w/crlf  attr/ test/powershell/Language/Classes/scripting.Classes.using.tests.ps1
i/crlf w/crlf  attr/ test/powershell/Language/Classes/scripting.enums.tests.ps1
i/crlf w/crlf  attr/ test/powershell/Language/CompletionTestSupport.psm1
i/crlf w/crlf  attr/ test/powershell/Language/Parser/Ast.Tests.ps1
i/crlf w/crlf  attr/ test/powershell/Language/Parser/AutomaticVariables.Tests.ps1
i/crlf w/crlf  attr/ test/powershell/Language/Parser/BNotOperator.Tests.ps1
i/crlf w/crlf  attr/ test/powershell/Language/Parser/Conversions.Tests.ps1
i/crlf w/crlf  attr/ test/powershell/Language/Parser/ExtensibleCompletion.Tests.ps1
i/crlf w/crlf  attr/ test/powershell/Language/Parser/LanguageAndParser.TestFollowup.Tests.ps1
i/crlf w/crlf  attr/ test/powershell/Language/Parser/MethodInvocation.Tests.ps1
i/crlf w/crlf  attr/ test/powershell/Language/Parser/ParameterBinding.Tests.ps1
i/crlf w/crlf  attr/ test/powershell/Language/Parser/Parser.Tests.ps1
i/crlf w/crlf  attr/ test/powershell/Language/Parser/RedirectionOperator.Tests.ps1
i/crlf w/crlf  attr/ test/powershell/Language/Parser/TypeAccelerator.Tests.ps1
i/crlf w/crlf  attr/ test/powershell/Language/Parser/UsingAssembly.Tests.ps1
i/crlf w/crlf  attr/ test/powershell/Language/Parser/UsingNamespace.Tests.ps1
i/crlf w/crlf  attr/ test/powershell/Language/Scripting/Debugging/DebuggerScriptTests.Tests.ps1
i/crlf w/crlf  attr/ test/powershell/Language/Scripting/Debugging/DebuggingInHost.Tests.ps1
i/crlf w/crlf  attr/ test/powershell/Language/Scripting/HashtableToPSCustomObjectConversion.Tests.ps1
i/crlf w/crlf  attr/ test/powershell/Language/Scripting/LineEndings.Tests.ps1
i/crlf w/crlf  attr/ test/powershell/Language/Scripting/NativeExecution/NativeCommandArguments.Tests.ps1
i/crlf w/crlf  attr/ test/powershell/Language/Scripting/NativeExecution/NativeStreams.Tests.ps1
i/crlf w/crlf  attr/ test/powershell/Modules/Microsoft.PowerShell.Security/GetCredential.Tests.ps1
i/crlf w/crlf  attr/ test/powershell/Modules/Microsoft.PowerShell.Security/TestData/CatalogTestData/UserConfigProv/D...
i/crlf w/crlf  attr/ test/powershell/Modules/Microsoft.PowerShell.Utility/Clear-Variable.Tests.ps1
i/crlf w/crlf  attr/ test/powershell/Modules/Microsoft.PowerShell.Utility/Eventing.Tests.ps1
i/crlf w/crlf  attr/ test/powershell/Modules/Microsoft.PowerShell.Utility/Export-Alias.Tests.ps1
i/crlf w/crlf  attr/ test/powershell/Modules/Microsoft.PowerShell.Utility/Get-Variable.Tests.ps1
i/crlf w/crlf  attr/ test/powershell/Modules/Microsoft.PowerShell.Utility/Import-Alias.Tests.ps1
i/crlf w/crlf  attr/ test/powershell/Modules/Microsoft.PowerShell.Utility/New-Variable.Tests.ps1
i/crlf w/crlf  attr/ test/powershell/Modules/Microsoft.PowerShell.Utility/Set-Variable.Tests.ps1
i/crlf w/crlf  attr/ test/powershell/Modules/Microsoft.PowerShell.Utility/Trace-Command.Tests.ps1
i/crlf w/crlf  attr/ test/powershell/Modules/Microsoft.PowerShell.Utility/Update-FormatData.Tests.ps1
i/crlf w/crlf  attr/ test/powershell/Modules/Microsoft.PowerShell.Utility/Update-TypeData.Tests.ps1
i/crlf w/crlf  attr/ test/powershell/Modules/Microsoft.PowerShell.Utility/assets/TestCsv2.csv
i/crlf w/crlf  attr/ test/powershell/Modules/Microsoft.PowerShell.Utility/assets/UpdateFormatDataTests.format.ps1xml
i/crlf w/crlf  attr/ test/powershell/Modules/Microsoft.WSMan.Management/CredSSP.Tests.ps1
i/crlf w/crlf  attr/ test/powershell/Provider/AutomountSubstDrive.ps1
i/crlf w/crlf  attr/ test/powershell/Provider/AutomountSubstDriveCore.ps1
i/crlf w/crlf  attr/ test/powershell/Provider/AutomountVHDDrive.ps1
i/crlf w/crlf  attr/ test/powershell/Provider/Pester.AutomountedDrives.Tests.ps1
i/crlf w/crlf  attr/ test/powershell/Provider/ProviderIntrinsics.Tests.ps1
i/crlf w/crlf  attr/ test/powershell/README.md
i/crlf w/crlf  attr/ test/powershell/engine/Api/Serialization.Tests.ps1
i/crlf w/crlf  attr/ test/powershell/engine/Basic/Attributes.Tests.ps1
i/crlf w/crlf  attr/ test/powershell/engine/Basic/DefaultCommands.Tests.ps1
i/crlf w/crlf  attr/ test/powershell/engine/Cdxml/Cdxml.Tests.ps1
i/crlf w/crlf  attr/ test/powershell/engine/Cdxml/assets/CimTest/CdxmlTest.psd1
i/crlf w/crlf  attr/ test/powershell/engine/Cdxml/assets/CimTest/CimTest.cdxml
i/crlf w/crlf  attr/ test/powershell/engine/Cdxml/assets/CimTest/CreateCimTest.mof
i/crlf w/crlf  attr/ test/powershell/engine/Cdxml/assets/CimTest/DeleteCimTest.mof
i/crlf w/crlf  attr/ test/powershell/engine/ETS/Adapter.Tests.ps1
i/crlf w/crlf  attr/ test/powershell/engine/ParameterBinding/BooleanParameterDCR.Tests.ps1
i/crlf w/crlf  attr/ test/tools/OpenCover/OpenCover.Types.ps1xml
i/crlf w/crlf  attr/ test/tools/OpenCover/OpenCover.psd1
i/crlf w/crlf  attr/ test/tools/TestExe/TestExe.cs
i/crlf w/crlf  attr/ test/tools/TestExe/TestExe.csproj

@iSazonov
Copy link
Collaborator

iSazonov commented Sep 7, 2017

I suggest split the super large PR. May start with fixing copyright, trademark, and registered trademark symbols in comments and md files?

@mklement0
Copy link
Contributor Author

mklement0 commented Sep 7, 2017

@rkeithhill: Great analysis (and nice combination of git and PowerShell) - good to know about git ls-files --eol: it is certainly the better tool to use for this analysis, though, as it turns out, my approach unwittingly unearthed edge cases:

The reasons for the CRLF-file-count discrepancy are:

  • file doesn't mention newlines for assets/license.rtf, just Rich Text Format data, version 1, unknown character set, so my CRLF substring matching didn't match it. [update: actually, the primary reason is that case-sensitive regex /text/ didn't match the Text in the RTF description, but even with that corrected, the argument still applies].

  • my CRLF substring matching included the 5 mixed newline-style files.

  • my CRLF substring matching matched 3 files that git considers binary (-text), because they contain CR-only newlines too (as an aside: VSCode apparently quietly discards such isolated CR instances on opening):

src/Schemas/PSMaml/Maml.tbr
src/System.Management.Automation/resources/CmdletizationCoreResources.resx
test/powershell/Modules/Microsoft.PowerShell.Security/TestData/CatalogTestData/CatalogTestFile1.mof

@mklement0
Copy link
Contributor Author

@lzybkr:

  • git's native newline style is LF.
  • git considers a file normalized with respect to newlines if it has LF newlines.
  • git's newline conversion features focus on locally, temporarily changing newline styles for the workspace, if needed, without affecting normalized storage in the repository.

With git's default configuration (core.autocrlf unset on Unix, set to true on Windows):

  • you have to go out of your way to commit a text file that is stored with CRLF newlines in the repository (such as explicitly creating a CRLF file on Unix and adding it to the repo there).
  • and if you do so, your workspaces on Unix will contain CRLF files (because no conversion is applied on checkout), which is problematic.

In other words: What you're strongly opposing is:

  • git's native behavior.

  • the state of roughly 95% of the text files currently in the repository (they're stored with LF newlines; that also means we're mostly already paying the performance penalty you mention on Windows).


@rkeithhill's suggestion is the way to go for enforcing platform-appropriate check-out and check-in behavior as part of the repo rather than relying on contributors' global git configuration to be correct (the problem you mentioned).

Specifically, this means adding the following line to the extant .gitattributes repo file:

* text=auto

To quote from man git-attrib:

If you want to ensure that text files that any contributor introduces to the repository have their line endings normalized, you can set the text attribute to "auto" for all files.


So, with respect to newlines, I think the way to go is:

  • Convert all CRLF-only, mixed-CRLF-LF/CR to LF-only files in the repo as a one-time action.

    • If there are truly files that need CRLF line endings even on Unix, add them as exceptions to .gitattributes.

    • If you "unblame" me, should my PR be accepted, please do so - we're talking about 103 files currently (the 100 @rkeithhill mentioned, plus the 3 I mentioned later (the ones with isolated CR instances)).

  • Add * text=auto to .gitattributes to ensure that appropriate to-and-from-the-workspace conversion happens on Windows, without relying on contributors' git installations to be configured correctly.


@iSazonov's suggestion to split this - definitely unusable in it current form - PR into multiple PRs makes sense, and I'm happy to tackle that, but I need consensus and clear guidance, not just on the newline issue, but also on the specific Unicode-character/encoding-normalization questions raised above.

@iSazonov
Copy link
Collaborator

iSazonov commented Sep 8, 2017

If we have right git config in the repo users will get it after fork in local workspace - so I vote for LF in the repo.

We can fix some things out of the PR and we should do so. I ask to confirm that we should replace copyrights (and so on) with ASCII chars if we want ASCII files.

@lzybkr
Copy link
Contributor

lzybkr commented Sep 8, 2017

Primarily I object to the need to review hundreds of files. I can't in good conscience merge a change like this because I'm forced to review every single line or risk taking an inadvertent or possibly malicious change.

At a minimum, I would ask that you provide the scripts that made the changes and a PowerShell team member make the actual changes - that would address my concerns about needing to review every line.

It's much easier to review the BOM change - and I see the value in that as well, though we disagree on the approach to removing Unicode characters. I can repeat my objections - people do use tools that mess up those characters, and by sticking with ASCII (and enforcing it with tools), I can let my guard down some.

@iSazonov
Copy link
Collaborator

iSazonov commented Sep 8, 2017

We could add the checks in Meta test and git hooks.

@mklement0
Copy link
Contributor Author

mklement0 commented Sep 8, 2017

@lzybkr:

Re newlines: Understood; I'll list the shell commands I used in a separate comment.

Re encoding / Unicode characters:

Even if we go with the ASCII approach, we need a way to enforce the desired encoding going forward either way, so as to avoid sliding back into a confusion of encodings and newline styles over time.

As @iSazonov mentions, server-side Git hooks and possibly checks during CI testing can give us that.

With the ability to enforce the desired encoding / newlines styles, we may as well enforce BOM-less UTF-8, which radically simplifies things going forward, just like it will for end-users, once PS Core speaks BOM-less UTF-8 natively.


As an aside:

It's also worth considering supplying a custom .gitconfig file and a (client-side) pre-commit hook script as part of the repo to help with enforcing whitespace and encoding rules client-side.

While there is no fully automatic way to install them, a one-time setup script that must be run every time a local repo is set up would suffice - and, if feasible, a server-side check failing could recommend running that setup script in its error message.

@mklement0
Copy link
Contributor Author

bash shell commands for normalizing all line endings to LF

  • You should be able to run this on either macOS or Linux (I ran it on macOS 10.12.6), from the PS repo folder.

  • The only prerequisite is utility dos2unix; install it as follows:

    • on Ubuntu, for instance: sudo apt-get install dos2unix
    • on macOS, with Homebrew installed: brew install dos2unix

Note that the commands do 3 things:

  • all CRLF instances are converted to LF
  • remaining, stray CR-only instances are removed
  • a trailing LF is added to any single-line text files without a newline

The commands create aux. file test-files.txt in the repo folder - remove it once conversion is done.

# Save a list of all *text* files in 'text-files.txt', for use below.
git ls-files -z | xargs -0 file | awk -F: '$2 ~ /[Tt]ext/' > text-files.txt

#  The following converts all CR-containing files to LF-only newlines.
awk -F: '$2 ~ / CR/ { print $1 }' text-files.txt | xargs -I{} dos2unix {}

# Now remove any remaining *isolated CR* instances from those files that had them.:
awk -F: '/ CR[ ,]/ {print $1}' text-files.txt | xargs -I{} sh -c 'tr -d "\r" <"$1" >"$1.$$" && mv "$1.$$" "$1"' - {}

# Those text files that have NO newlines at all (single-line files with no trailing newline):
# Append LF to them
awk -F: '$2 ~ / no line terminators/ { print $1 }' text-files.txt | xargs -I{} sh -c 'printf "\n" >> "$1"' - {}

@mklement0
Copy link
Contributor Author

@SteveL-MSFT

After looking into this some more:

Conversely, I would be fine having all of them CRLF since git will convert to just LF on Linux/Mac

That's actually not how it works:
Since git is LF-centric, CRLF files stored as such in the repo are considered NON-text files, and as such the automatic newline-conversion configuration settings do not apply.

As non-text ("binary") files, they are checked-out and committed as-is, both on Windows and Unix, which means that you'd get CRLF newlines in your working tree on Unix (which is clearly the wrong thing to do).

As stated, you must go out of your way - either intentionally or accidentally - to even succeed at committing a CRLF file with CRLFs preserved (assign the -text (non-text) attribute explicitly to a text file in .gitattributes, or, in the absence of attributes, commit a CRLF file while core.autocrlf is not set to true).

Conversely, anything that git considers a text file invariably has its newlines normalized to LF on staging (and therefore on committing) - even if a working-tree file's newline style doesn't match the expected one with respect to git's settings.

@SteveL-MSFT
Copy link
Member

@mklement0 thanks for the insights. I'll take care of this based on the commands you provided and will submit as new PR.

@mklement0
Copy link
Contributor Author

mklement0 commented Sep 11, 2017

Thanks, @SteveL-MSFT.

Below are commands for assisting with normalizing the encoding of all files to BOM-less UTF-8.

Even if you decide to go with ASCII, after all (boo, hiss), you can use the normalized files as a consistent starting point for the to-ASCII conversion; utility print-nonascii can help with locating the lines that require changes (see below).

Note the source-code comments re caveats.

I'm closing this PR, and I'm leaving this to you and your team to complete.

bash shell commands for normalizing all encodings to BOM-less UTF-8

  • You should be able to run this on either macOS or Linux (I ran it on macOS 10.12.6), from the PS repo folder.

  • The only prerequisite is utility print-nonascii; install it as follows, via npm (comes with Node.js):

    • npm i -g print-nonascii
    • For manual installation, see the repo.
  • The commands rely on aux. file test-files.txt, as created by the newline-normalization commands above.

# CAVEAT: RTF, XML, HTML and other files that specify the encoding *as part of the file* should be EXCLUDED below,
#         because re-encoding could leave such files in an inconsistent state.

# Convert all with-BOM UTF-8 files to BOM-less UTF-8
awk -F: '$2 ~ /UTF-8 Unicode \(with BOM\)/ { print $1 }' text-files.txt | xargs -I{} bash -c 'tail -c +4 "$1" > "$1.$$" && mv "$1.$$" "$1"' - {}

# Convert all Windows-1252 / ISO-8859 files to BOM-less UTF-8
awk -F: '$2 ~ /ISO-8859/ { print $1 }' text-files.txt | xargs -I{} bash -c 'iconv -f CP1252 -t UTF-8 "$1" > "$1.$$" && mv "$1.$$" "$1"' - {}

# There are Windows-1252 characters that are missing from ISO-8859-1 such as "curly quotes".
# `file` only knows about ISO-8859-1, so it reports files with such characters as "Non-ISO".
# We *assume* that such files are Windows-1252 files, and convert them to BOM-less UTF-8 based on that assumption.
# NOTE: If the assumption doesn't hold, characters may be *lost*, so this command may have to be
#       undone.
awk -F: '$2 ~ /Non-ISO/ { print $1 }' text-files.txt | xargs -I{} bash -c 'iconv -f CP1252 -t UTF-8 "$1" > "$1.$$" && mv "$1.$$" "$1"' - {}

After the conversion, use the following print-nonascii command to print lines with non-ASCII characters from all text files and visually verify that the re-encoding was successful in all cases.

Each line with one or more non-ASCII character is printed twice, preceded by the line number:

  • once as-is

  • again with PowerShell-style Unicode escapes; e.g., `u{2013} for (EN DASH)

Note that there are non-obvious non-ASCII characters such as punctuation that outwardly resembles ASCII punctuation, such as the EN DASH example;
similarly, there are instances of non-ASCII quotes.

# Print lines that contain non-ASCII chars. across all text files.
awk -F: '{ print $1 }' text-files.txt | tr '\n' '\0' | xargs -0 print-nonascii -n --psh -r

The above yields, for instance (only a few lines shown):

###	CHANGELOG.md
100:* Fix an issue in implicit remoting where restricted sessions couldn't use `Get-FormatData –PowerShellVersion`. (#4222)
100:* Fix an issue in implicit remoting where restricted sessions couldn't use `Get-FormatData `u{2013}PowerShellVersion`. (#4222)
###	ThirdPartyNotices.txt
5:The software is based on or incorporates material from the projects listed below (collectively, “Third Party Code”).  Microsoft is not the original author of the Third Party Code.  The original copyright notice and license, under which Microsoft received such Third Party Code, are set forth below. Microsoft reserves all rights not expressly granted herein, whether by implication, estoppel or otherwise.
5:The software is based on or incorporates material from the projects listed below (collectively, `u{201c}Third Party Code`u{201d}).  Microsoft is not the original author of the Third Party Code.  The original copyright notice and license, under which Microsoft received such Third Party Code, are set forth below. Microsoft reserves all rights not expressly granted herein, whether by implication, estoppel or otherwise.
###	assets/AppImageThirdPartyNotices.txt
8:Copyright © 1991-2016 Unicode, Inc. All rights reserved.
8:Copyright `u{a9} 1991-2016 Unicode, Inc. All rights reserved.
...

@mklement0 mklement0 closed this Sep 11, 2017
@mklement0 mklement0 deleted the source-encoding-newline-fix branch September 11, 2017 04:19
@SteveL-MSFT
Copy link
Member

I'll try to take care of this on the weekend so there isn't so many PRs causing merge conflicts.

@rkeithhill
Copy link
Collaborator

@SteveL-MSFT What do you think about the suggested change to .gitattributes to enforce the desired line-endings for this repo?

@mklement0
Copy link
Contributor Author

Additionally, let me reiterate the need for (at least) a server-side hook that rejects improperly encoded text files (except deliberately differently encoded test files, if applicable).

@SteveL-MSFT
Copy link
Member

I'm prepping a PR to at least change the .gitattributes to enforce it so we'll slowly get there. I'll bulk fix the rest after 6.0.0. Also doing some other fixes in this PR I'm submitting soon (like copyright symbol)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants