-
Notifications
You must be signed in to change notification settings - Fork 8.1k
Implements #4752, and more: line endings standardized, converted all files to BOM-less UTF-8 #4761
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implements #4752, and more: line endings standardized, converted all files to BOM-less UTF-8 #4761
Conversation
|
@mklement0, |
|
Please fix the test failures. Also, submit with |
…zed, converted all file to BOM-less UTF-8
|
I get the re-encoding work but why change the line-endings to LF? Why not let autocrlf continue to do its magic? If you do go through with this change, then you should update .gitattributes to enforce the EOL seq e.g |
|
Based on discussion with @lzybkr we both agree that our files should just be ascii. Any characters outside of that range should be escaped appropriately. Also, agree with @rkeithhill that line endings shouldn't change as git will convert as necessary. |
|
Re ASCII: I can do that, however:
Again, going with the already agreed-upon BOM-less UTF-8 as the implied default encoding seems like the better choice to me. Re newlines: Of the 3053 text files currently in the repo, only 366 are currently CRLF; the remaining 2687 already are LF-only files. What benefit is there in allowing this mix of newline styles? As long as we make the tests aware of platform-typical checkout behavior, we can continue to let @rkeithhill: Is adding that one line to |
|
@mklement0 for comments and markdown files, I think using (C) or (R) is sufficient, the escape sequence was specifically about code I have noticed that sometimes after I push a change, GitHub doesn't display the diff correctly because of line endings (I usually end up resaving it as LF which seems to fix the problem so I suspect this is the issue). So if converting the rest to have consistency and potentially resolves this problem, I'm ok with having everything LF on the repo side. |
SteveL-MSFT
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
File encoding should be ASCII. Ok with line endings being LF.
|
I'm not OK with LF - we should use "native" line endings where most of the developers are. Today and for the foreseeable future, that skews towards Windows. If you are finding consistency issues though, I'm happy to see that fixed, just not moving to LF. |
|
I'll list the commands that I ran on macOS 10.12.6 from
# All text files that are in the repo.
$ git ls-files -z | xargs -0 file | awk -F: '$2 ~ /text/' > text-files.txt; wc -l < text-files.txt
2080
# Those that have CRLF line endings.
$ awk -F: '$2 ~ /CRLF/' text-files.txt | wc -l
107
# Those that are UTF-8 with a BOM - whether they contain actual non-ASCII chars. or not.
$ awk -F: '$2 ~ /UTF-8 Unicode \(with BOM\)/' text-files.txt | wc -l
400
# Those that are actually UTF-8 encoded, despite not having a BOM:
# They contain UTF-8-encoded non-ASCII characters.
# (All BOM-less files that have ASCII-only characters are reported as ASCII by `file`)
$ awk -F: '$2 ~ /UTF-8 Unicode text/' text-files.txt | wc -l
19
# Those that are Windows-1252 encoded (ISO-8859):
$ awk -F: '$2 ~ /ISO-8859/' text-files.txt | wc -l
12
# Those that contain invalid characters.
# They have no BOM and contain characters that are neither Windows-1252 nor UTF-8-encoded:
$ awk -F: '$2 ~ /Non-ISO/' text-files.txt | wc -l
4As you can see, there are a number of inconsistencies. I think it makes sense to normalize the newlines to LF in the repo. My understanding is that if that's all we do, the Re conversion to ASCII; the non-ASCII characters in the current text files fall into the following categories:
Should the files that do need to retain non-ASCII characters be normalized with respect to encoding and BOM? I assume UTF-8 is the way to go, but consistently with or consistently without BOM (my vote)? Please let me know exactly how you want to resolve all these inconsistencies before I take another stab at it. |
|
As an aside: Many Correct: Incorrect (e.g., |
|
@lzybkr git converts on the client the native line endings so on Windows, you end up with CRLF, so I'm not sure why you are against having just LF for the files in the repo. Conversely, I would be fine having all of them CRLF since git will convert to just LF on Linux/Mac, but it's an extra character for every line. |
|
One additional reason to not change line endings - |
|
My original reason was that git stores the head for fast checkouts. By storing LF - git is forced to convert on checkout on Windows but not elsewhere - making checkout slower where more development is done. If more development was done on Linux/Mac - then there might be an argument for changing the line endings if you didn't mind having basically every line of code appear to be authored by the same person. Note that we can change the author for such a commit - we've used |
|
Oh - one more reason - git automatically converts only if you set git up correctly. Not everybody does that, and it's a pain when they don't. I speak from experience here - it wastes maintainers time helping contributors get set up correctly, fix their commits, etc. So now that I've paged back in a bunch of reasons why I'm opposed to this PR, I'll actually strongly opposed to this PR as is. |
|
@mklement0 I see different stats for EOL types: Also, to prevent individuals with varying autocrlf settings, perhaps the .gitattributes file should enforce whatever decision is made here: https://help.github.com/articles/dealing-with-line-endings/#per-repository-settings |
|
Here are the 100 files stored in the repo (index) as CRLF: |
|
I suggest split the super large PR. May start with fixing copyright, trademark, and registered trademark symbols in comments and md files? |
|
@rkeithhill: Great analysis (and nice combination of git and PowerShell) - good to know about The reasons for the CRLF-file-count discrepancy are:
|
With git's default configuration (
In other words: What you're strongly opposing is:
@rkeithhill's suggestion is the way to go for enforcing platform-appropriate check-out and check-in behavior as part of the repo rather than relying on contributors' global Specifically, this means adding the following line to the extant * text=autoTo quote from
So, with respect to newlines, I think the way to go is:
@iSazonov's suggestion to split this - definitely unusable in it current form - PR into multiple PRs makes sense, and I'm happy to tackle that, but I need consensus and clear guidance, not just on the newline issue, but also on the specific Unicode-character/encoding-normalization questions raised above. |
|
If we have right git config in the repo users will get it after fork in local workspace - so I vote for LF in the repo. We can fix some things out of the PR and we should do so. I ask to confirm that we should replace copyrights (and so on) with ASCII chars if we want ASCII files. |
|
Primarily I object to the need to review hundreds of files. I can't in good conscience merge a change like this because I'm forced to review every single line or risk taking an inadvertent or possibly malicious change. At a minimum, I would ask that you provide the scripts that made the changes and a PowerShell team member make the actual changes - that would address my concerns about needing to review every line. It's much easier to review the BOM change - and I see the value in that as well, though we disagree on the approach to removing Unicode characters. I can repeat my objections - people do use tools that mess up those characters, and by sticking with ASCII (and enforcing it with tools), I can let my guard down some. |
|
We could add the checks in Meta test and git hooks. |
|
Re newlines: Understood; I'll list the shell commands I used in a separate comment. Re encoding / Unicode characters: Even if we go with the ASCII approach, we need a way to enforce the desired encoding going forward either way, so as to avoid sliding back into a confusion of encodings and newline styles over time. As @iSazonov mentions, server-side Git hooks and possibly checks during CI testing can give us that. With the ability to enforce the desired encoding / newlines styles, we may as well enforce BOM-less UTF-8, which radically simplifies things going forward, just like it will for end-users, once PS Core speaks BOM-less UTF-8 natively. As an aside: It's also worth considering supplying a custom While there is no fully automatic way to install them, a one-time setup script that must be run every time a local repo is set up would suffice - and, if feasible, a server-side check failing could recommend running that setup script in its error message. |
|
|
After looking into this some more:
That's actually not how it works: As non-text ("binary") files, they are checked-out and committed as-is, both on Windows and Unix, which means that you'd get CRLF newlines in your working tree on Unix (which is clearly the wrong thing to do). As stated, you must go out of your way - either intentionally or accidentally - to even succeed at committing a CRLF file with CRLFs preserved (assign the Conversely, anything that git considers a |
|
@mklement0 thanks for the insights. I'll take care of this based on the commands you provided and will submit as new PR. |
|
Thanks, @SteveL-MSFT. Below are commands for assisting with normalizing the encoding of all files to BOM-less UTF-8. Even if you decide to go with ASCII, after all (boo, hiss), you can use the normalized files as a consistent starting point for the to-ASCII conversion; utility Note the source-code comments re caveats. I'm closing this PR, and I'm leaving this to you and your team to complete.
|
|
I'll try to take care of this on the weekend so there isn't so many PRs causing merge conflicts. |
|
@SteveL-MSFT What do you think about the suggested change to |
|
Additionally, let me reiterate the need for (at least) a server-side hook that rejects improperly encoded text files (except deliberately differently encoded test files, if applicable). |
|
I'm prepping a PR to at least change the .gitattributes to enforce it so we'll slowly get there. I'll bulk fix the rest after 6.0.0. Also doing some other fixes in this PR I'm submitting soon (like copyright symbol) |
Implements #4752, though not exactly as asked:
This PR does not replace the copyright/trademark symbols with their equivalent Unicode escape sequences, but instead - in line with BOM-less UTF-8 becoming the default - performs the following tasks:
all CRLF newlines were converted to LF-only newlines
all (by definition BOM-less) Windows-1252-encoded files were converted to BOM-less UTF-8.
all UTF-8 files with BOM had their BOM removed.
a few files contained invalid characters; they were replaced/removed.
In other words: all files are now BOM-less UTF-8 files with LF-only line endings, and, ideally, only such files should be added in the future.