Skip to content

Conversation

@iSazonov
Copy link
Contributor

No description provided.

Copy link
Contributor

@vexx32 vexx32 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for writing this up! Couple small nits.

Overall I think this is a good idea, but I'm not sure about the case-sensitive by default for most of the suggested values. It does match the c# underpinnings but it's also the reverse of PowerShell's default, where case-_in_sensitive is the assumption. I don't know about suddenly switching that around with these parameters.

vexx32 and others added 2 commits April 24, 2019 16:39
…nal.md

Co-Authored-By: iSazonov <darpa@yandex.ru>
…nal.md

Co-Authored-By: iSazonov <darpa@yandex.ru>
@iSazonov
Copy link
Contributor Author

I'm not sure about the case-sensitive by default for most of the suggested values

@vexx32 The RFC doesn't suggest to change the case defaults.

@vexx32
Copy link
Contributor

vexx32 commented Apr 24, 2019

Ah, you're right. That was the alternate proposal, I misread.

I'm good with the primary proposal. 👍

@iSazonov
Copy link
Contributor Author

Ah, you're right. That was the alternate proposal, I misread.

Alternative proposal is not a breaking change. It only keep consistency - one API for script and binary.

@iSazonov
Copy link
Contributor Author

@SteveL-MSFT @vexx32 @mklement0 Could you please ping community to discuss the RFC?
I would like to quickly begin the implementation of what I have been trying to do for a long time - to move towards the speed of ripgrep

@mklement0
Copy link
Contributor

@iSazonov:

(As an aside: I personally cannot ping any community - I have no blog and have only a handful of Twitter followers; I rarely tweet myself.)

As currently written, your RFC doesn't provide adequate motivation; there's more to be included from PowerShell/PowerShell#9348.

As currently written, the RFC reads like a passive-aggressive resistance to the decision made in PowerShell/PowerShell#9348 (comment):

Your desire for unconditional consistency between PowerShell and CoreFx / C# APIs is misguided, and such a discussion doesn't belong in this RFC.
In short: PowerShell intentionally deviates from CoreFx defaults at times, notably with respect to case-sensitivity. PowerShell end users are not necessarily C# developers. While it makes sense to avoid arbitrary deviations, fundamental differences in philosophy must be honored.

@joeyaiello
Copy link
Contributor

@PowerShell/powershell-committee had a limited review on this today (only 3 out of 6 of us in attendance). A couple notes:

  • It would be super useful to have a couple examples of how this would work. (As an aside, @JamesWTruher is adding an examples header to the template so we always have this going forward.)
  • A little more context on the motivation would be useful as well. For example, if ripgrep-like perf is part of your motivation, explain how this would help achieve that
  • @JamesWTruher made the case (in agreement with the proposal) that we're targeting PowerShell customers who don't understand the existing C# APIs, and won't benefit from having them tacked onto the cmdlets. Developers already handling the APIs will be able to understand the nuance assuming we doc it well.
  • Overall, we agree with the proposal, but we need to wait to have quorum to officially approve it (and it would be helpful to have the updates requested).

@iSazonov
Copy link
Contributor Author

Before adding commits, I want to discuss it here.

Several cmdlets perform operations based solely on the current culture:

  • Compare-Object
  • Group-Object
  • Sort-Object
  • Select-Object
  • Select-String

while the full set of capabilities should include operations with an indication of a specific culture, and not only linguistic operations but also non-linguistic operations.

There are areas of scenarios that rather require non-linguistic operations (such as searching in logs).
Non-linguistic operations are also much faster, which is important for processing large files (and modern logs are usually very large).

Linguistic operations include CurrentCulture, specific Culture and InvariantCulture. Non-linguistic is Ordinal.
A special case is OrdinalIgnoreCase. .Net Core on Windows uses P/Invoke, and the implementation is closed. At the same time, .Net Core on Unix uses its own implementation
https://github.com/dotnet/coreclr/blob/068aa8bbb7a3f303c809775561d70c1875149853/src/corefx/System.Globalization.Native/pal_collation.c#L707-L743

We can also see such a comment
https://github.com/dotnet/coreclr/blob/068aa8bbb7a3f303c809775561d70c1875149853/src/corefx/System.Globalization.Native/pal_collation.c#L544-L546

    // On Windows with InvariantCulture, the LATIN SMALL LETTER DOTLESS I (U+0131)
    // capitalizes to itself, whereas with ICU it capitalizes to LATIN CAPITAL LETTER I (U+0049).
    // We special case it to match the Windows invariant behavior.

(Another confirmation
https://github.com/dotnet/coreclr/blob/c5a44f58952c5014f5e1c25b667dca3901fd84a7/src/System.Private.CoreLib/shared/System/Globalization/CompareInfo.cs#L451-L459
)

This suggests that OrdinalIgnoreCase is based on InvariantCulture (linguistic)

.Net Core actively uses local performance optimization for ASCII characters. Measurements show that OrdinalIgnoreCase for ASCII characters is twice as fast as for other characters (which also confirms the use of InvariantCulture).

Such optimization leads to the discrimination of languages ​​other than English that looks unjustified.
We might expect OrdinalIgnoreCase to be non-linguistic as well as Ordinal, but it is not. It is expected to be able to non-linguistic option for IgnoreCase whose performance is comparable to Ordinal.

To achieve this, the CoreFXLab repository package is implemented with support for Simple Case Folding (SCF) based on Unicode standard.
The SCF standard is also used in the ripgrep utility, which ignores culture settings and always uses SCF.
Using SFC in the ripgrep utility is one of the ways to get a quick search. The implementation of Regex in CoreFX has full capabilities but is very slow. Based on the SFC, we could implement a quick search option for a simple Regex subset here or in CoreFX. The Regex subset could be involved with existing SimpleMatch parameter or with new FastMatch parameter.

It should be noted that Unicode support in Core is based on the implementation of the underlying OS, which sometimes implies different behavior. With SFC, we will always have the same behavior on all platforms. (This is by the way important for PowerShell identifiers, but it’s not in the scope of this RFC) (Also there was an issue for HashSet fixed with InvariantCulture but it seems SFC is more suitable.)

So the suggestion is to

  • add the ability to specify a specific culture with values ​​like en-Us, ru-RU and others
  • add the ability to specify special values
    • CurrentCulture
    • InvariantCulture
  • add the ability to specify traditional Ordinal (non-lingustic) / OrdinalIgnoreCase (lingustic) in the form that we will approve in this discussion
    (The presence of the CaseSensetive parameter is even more confusing, since the Ordinal will be interpreted either as lingustic, then as non-lingustic)
  • add the ability to specify a non-lingustic option that is SimpleCaseFolding
  • implement fast Regex subset based on SFC with existing SimpleMatch parameter or with new FastMatch parameter
  • add the ability to specify AsByteStream - see description below

Suggestions for implementation:

  1. pack all this in one parameter Culture (the name is controversial as seen below)
    There are two problems
    • shuffling of crop names (en-Us) with special values, which makes it difficult for users to discover (although we could only implement the completer with special values)
    • it is not clear how to combine the dual Ordinal (non-lingustic) / OrdinalIgnoreCase (lingustic) with the name Culture which implies lingustic, and also what names should be used if we do not want to overlap with C#
Select-String -Culture ru-RU -CaseSensetive      # lingustic
Select-String -Culture ru-RU                     # lingustic
Select-String -Culture Ordinal -CaseSensetive    # non-lingustic
Select-String -Culture Ordinal                   # lingustic - OrdinalIgnoreCase
Select-String -Culture Invariant -CaseSensetive  # lingustic
Select-String -Culture Invariant                 # lingustic - InvariantCultureIgnoreCase
Select-String -Culture SimpleCaseFolding         # non-lingustic
  1. use two parameters
    • Culture for en-Us type names
    • Comparison for special names (the problem with the choice of names is the same)
Select-String -Culture ru-RU -CaseSensetive      # lingustic as expected for Culture term
Select-String -Culture ru-RU                     # lingustic

Select-String -Comparison Ordinal -CaseSensetive # non-lingustic
Select-String -Comparison Ordinal                # lingustic - OrdinalIgnoreCase
Select-String -Comparison Invariant -CaseSensetive  # lingustic
Select-String -Comparison Invariant              # lingustic - InvariantCultureIgnoreCase
Select-String -Comparison SimpleCaseFolding      # non-lingustic
  1. Regex subset based on SCF
    • implement "^","$",".","*","+","\d","\r","\n","\t","\w" without backtracking
Select-String -SimpleMatch "^Error.+count"       # Current default is CurrentCulture - we'll have to change to SCF
Select-String -FastMatch "^Error.+count"         # Based on SCF

Select-String -SimpleMatch "^Error.+count" -Match "<Full Regex>"
Select-String -FastMatch "^Error (.+) count" -Match "<Full Regex>"     # First do fast search a candidate string in large file
                                                                       # then parse it with full-featured Regex

A separate problem is that the grep / ripgrep utilities have an option to interpret the input stream from the file as byte ("binary" in their terms) which can be useful. FileSystemProvider offers us the AsByteStream parameter, which can also be useful here.
This parameter has been separated from the Encoding parameter. This parameter is used in Select-String but not in other above mentioned cmdlets. Perhaps we should unify the cmdlets to use Encoding and AsByteStream wherever Path / LiteralPath parameters are used.
This is also an example of the fact that we did not mix encodings and special values. Perhaps we should also deal with cultures and use two different parameters.

Select-String -Path source.txt -AsByteStream -SimpleMatch "qwerty"
Select-String -Path source.txt -Encoding Utf8 -SimpleMatch "qwerty"

@joeyaiello
Copy link
Contributor

@PowerShell/powershell-committee reviewed your latest reply. First, really appreciate the additional examples and exposition. Feel free to incorporate them into the RFC when you're ready.

Right now, we're leaning towards your proposal 2 with some notes:

  • We would have to create a new enum type for -Comparison that includes -Ordinal, -Invariant, and -SimpelCaseFolding to give us tab completion
    • Ideally, we shouldn't add SimpleCaseFolding until we get it for free when it's added to .NET Core
  • -Culture can be a distinct parameter that uses a different tab completer for cultures
  • This also means -Culture and -Comparison should not be included in the same parameter set
  • I know you don't necessarily agree with our approach of -Comparison/Culture + -CaseSensitive, but we ultimately rested on @mklement0 's argument from Enhance some cmdlets with Culture and Comparison parameters PowerShell#9348 about the the awkwardness of specifying comparisons like -CurrentCulture that our case-sensitive when PowerShell typically defaults to case-insensitive:
  • Note the double awkwardness of -Culture ru-RU -Comparison CurrentCulture:
    • The word current can seem contradictory with specifying a specific culture.
    • From a PowerShell mindset, nothing indicates case-sensitivity.

@iSazonov
Copy link
Contributor Author

iSazonov commented Jul 2, 2019

I know you don't necessarily agree

I agree with best design ( what is best? :-) ). And it depends on how we resolve:

Ideally, we shouldn't add SimpleCaseFolding until we get it for free when it's added to .NET Core

It's a vicious circle. :-) Core team want to see use-cases before. We want to see the API in .Net Core before. @joeyaiello Could you please discuss the SimpleCaseFolding design and incorporating in .Net Core with Core team? After that, a way will be opened without danger of making breaking changes in future.

@iSazonov
Copy link
Contributor Author

iSazonov commented Jul 2, 2019

@mklement0 Could you please take another look?

@joeyaiello
Copy link
Contributor

@PowerShell/powershell-committee reviewed some more:

  • I hear you loud and clear on the SimpleCaseFolding stuff. I can't promise a super timely discussion with them, but I'll do my best to get in touch with someone. Found the issue here and it looks like they changed the milestone from Future to 5.0, so maybe we're already in a good place there. 🤞
  • We're unclear about the relationship between the Culture/Comparison work (which we're prepared to accept minus the SimpleCaseFolding) and the ripgrep/-FastMatch work. They feel like different concepts even if they live in similar aspects of the code. If you can please split that aspect of the RFC out into another RFC, we can take the rest of it now.
  • I know you've been working heavily in this code, do you have any kind of partial implementation already? Especially if it can provide any kind of performance data today?

@iSazonov
Copy link
Contributor Author

iSazonov commented Aug 13, 2019

I hear you loud and clear on the SimpleCaseFolding stuff.

Thanks! My main concern - if .Net Core will have another API design we will fall in a breaking change. If you could speed up the start of this design process it would be very good. Of cause we can postpone SimpleCaseFolding implementation without removing from the RFC. My current thoughts is to implement Culture, then Comparison, then FastMatch.

We're unclear about the relationship between the Culture/Comparison work (which we're prepared to accept minus the SimpleCaseFolding) and the ripgrep/-FastMatch work.

This was added only to show the full picture. It is not in the RFC and we could have another RFC for this proposal.

do you have any kind of partial implementation already?

If you ask about Select-String I have no prototype. I have two prototypes - SimpleCaseFolding and Regex0.

My vision today is that we could split the work on 2 main parts:

  1. Implement Culture and Comparison parameters for some mentioned above cmdlets (by one to simplify PRs)
  2. Performance optimizations that is others 3 parts:
  • using SimpleCaseFolding - that gives direct perf win 2-3x for non-English languages (Perhaps for English too but it requires more investigations and refactoring my SimpleCaseFolding implementation by experts. My last optimizations was for .Net Core 3 Preview2)
  • using new light-weight Regex implementation. We could use my prototype https://github.com/iSazonov/Regex0. It is faster in 4-5 times than full featured .Net Core.
  • excluding Unicode transcoding. Today we always transcode to Utf16 that is extra step for Utf8 files.

I hope now the RFC ready for final review.

@iSazonov
Copy link
Contributor Author

@joeyaiello @SteveL-MSFT The RFC is ready for final review.
https://github.com/dotnet/corefx/issues/41333 .Net Core 5.0 will have Simple Case Folding, so I removed all references to SCF from the RFC.

@iSazonov iSazonov changed the title Enhance some cmdlets with Culture and Comparison parameters Enhance some cmdlets with Culture parameter Oct 30, 2019
##### Samples

```powershell
Select-String -Culture ru-RU -CaseSensitive # lingustic as expected for Culture term
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add some of the examples you already have in the code PR tests as examples here? Specifically looking to have both inputs and expected outputs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@joeyaiello
Copy link
Contributor

Thanks for the updates, @iSazonov! We'll put this one on our list to review

@daxian-dbw daxian-dbw self-requested a review March 3, 2020 21:09
@joeyaiello
Copy link
Contributor

@iSazonov given our desire to move towards the 2nd proposal you floated in this comment (and it looks like you're in agreement), the next step here would be for you to update the RFC to reflect that proposal (if you're still interested, of course).

Unfortunately, we were not able to move the ball forward with the .NET team on the SimpleCaseFolding changes, but I think that there are still benefits here with the -Comparison parameter even if we don't have the SCF performance improvements.

We'd also want to confirm that the default, parameter-less invocation of Select-String is not changing from the behavior today.

@iSazonov
Copy link
Contributor Author

Unfortunately, we were not able to move the ball forward with the .NET team on the SimpleCaseFolding changes, but I think that there are still benefits here with the -Comparison parameter even if we don't have the SCF performance improvements.

Currently .Net 6.0 was moved to ICU and if I understand correctly, now OrdinalIgnoreCase works as SimpleCaseFolding. But they lost a lot of performance because of p/invokes. They are going to invest in ICU and benefit from PGO.

(if you're still interested, of course)

You want working groups, don't you? Let's leave it to them.

@joeyaiello joeyaiello added the WG-Cmdlets-Utility cmdlets in the Microsoft.PowerShell.Utility module label Jul 27, 2021
Copy link
Member

@SteveL-MSFT SteveL-MSFT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, WG is discussing this one

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Review - Waiting on Author WG-Cmdlets-Utility cmdlets in the Microsoft.PowerShell.Utility module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants