Skip to content

Conversation

@vexx32
Copy link
Collaborator

@vexx32 vexx32 commented Aug 13, 2018

PR Summary

Implements 'u' and 's' token case in the Tokenizer.cs and CharTraits.cs parsing which allows the use of uint or short literals as numbers with a 'u' and/or 's' suffix:

PS> $Value = 108u
PS> $Value.GetType().FullName
System.UInt32

PS> $Short = 105.5us
PS> $Short
106
PS> $Short.GetType().FullName
System.UInt16

I also tidied up some of the code in the vicinity to inline some of the very temporary variable definitions.

Comprehensive Summary:

  • Adds type suffixes for u and s - uint and short, respectively.
  • Adds some logic to test whether UInt32 or UInt64 is the appropriate return type by comparing the multiplied value against UIUnt32.MaxValue when neither l nor s is supplied.
  • Adds [short], [ushort], [uint], and [ulong] type accelerators to help out people coming from C# or other languages. Mapped to int16, uint16, uint32, and uint64, respectively.
  • Maintains parity with existing l / L suffix, where it will also perform rounding functions when applied to a real literal while also ensuring it still returns an appropriate uint or ulong value: 105.5u => 106
  • Defines a new flags enum NumberSuffixFlags that takes care of one-or-more numerical type suffixes.
  • Can explicitly parse ul-suffixed numeric values as ulong and us-suffixed values as ushort.
  • Explicitly specified length types for short and long as well as declared unsigned types will error out and register as an invalid numeric literal if their values after multiplication exceeds the type contraints. This is intentional; if the user specifies a value that cannot be satisfied e.g., 500sgb (500 short, gigabyte multiplier) then the parser should not be expected to figure out what they meant.
  • Converts strNum parameter of TryGetNumberValue to ReadOnlySpan<char> to hopefully get a little better performance as it flies through all the TryParses.

PR Checklist

(not sure on all of those points; never worked on this repo before; feel free to let me know about stuff I'm missing)

@msftclas
Copy link

msftclas commented Aug 13, 2018

CLA assistant check
All CLA requirements met.

Ignoring things that aren't as a result of these changes
@iSazonov
Copy link
Collaborator

@vexx32 Please add tests.

@iSazonov
Copy link
Collaborator

@SteveL-MSFT Should we get PowerShellCommittee approvement?

@vexx32
Copy link
Collaborator Author

vexx32 commented Aug 13, 2018

I'll get on that this evening. Just realised there's one more case it should handle; native ulong input when there's no multiplier.

You guys think it's worth looking at adding short type accelerators for [uint] and/or [ulong] as well to have parity with regular int accelerators?

@vexx32
Copy link
Collaborator Author

vexx32 commented Aug 13, 2018

OK Forgive me for not really.. being aware here.

Are we talking xUnit or Pester tests in this instance? The latter I can do easily, the former... well, I'll have to do some reading!

@vexx32
Copy link
Collaborator Author

vexx32 commented Aug 13, 2018

@iSazonov hopefully these tests are workable. I tried to place them in as sensible a place as I could find; with the other parser notation tests.

I've also modified the accelerator checks to accept the new [uint] and [ulong] type accelerators. I think.

@SteveL-MSFT SteveL-MSFT added the Review - Committee The PR/Issue needs a review from the PowerShell Committee label Aug 13, 2018
@vexx32
Copy link
Collaborator Author

vexx32 commented Aug 13, 2018

Okay, I think I properly fixed both the tests and my ulong parsing. 😊

return false;
}

if (suffix == 'u' || suffix == 'U')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this work the same?

(9.9l).GetType().Fullname
(9.9u).GetType().Fullname

Suffix 'L' is in other code path - should we process 'U' in the same way?

Copy link
Collaborator Author

@vexx32 vexx32 Aug 14, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fairly confused why l and d are in separate code paths at all, honestly.

I would be more inclined to put it with d however as there is a pretty fundamental difference from int to uint which also means the actual bounds of the numerals are completely different to the int ranges.

Copy link
Collaborator Author

@vexx32 vexx32 Aug 14, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But... My inclination for that would almost be that it would be better for it to either throw or produce an unsigned double, if there is such a thing. (afaik there is not)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, upon reviewing the code paths for l suffix... there are two of them. It seems a bit like there's a double-up on code paths there.

I'm not sure how the real boolean is determined, but l suffix has a code path inside the double parsing segment for that as well as one outside and after it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Been thinking on this, and I think that it would probably be best to maintain parity with the existing L token suffix.

Adding a small branch for that code path in the if (real){} block as well, mimicking the doubling up the long parsing has. (I do wish there was a good way to consolidate them, but it looks like we're either gonna be doubling real checks for every parse or just doubling a couple of parse checks).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And done!

Had a slip up where I forgot my u token checks were before the check for real which caused a slight issue for a moment.

Also added a test or two to check the round-on-parse behaviour.

{
if (UInt32.TryParse(strNum, style, NumberFormatInfo.InvariantInfo, out uint u))
{
ulong testresult = u * (ulong)multiplier;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems the cast is unneeded.

Copy link
Collaborator Author

@vexx32 vexx32 Aug 14, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without it, it appears to convert to int32 or int64. From what I can tell, uint * int => int

Edit: Right, of course, that was before I put in the intermediary variable, yeah the cast shouldn't be needed.

Copy link
Collaborator Author

@vexx32 vexx32 Aug 14, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So take 2:
No, the cast is necessary. C# does not permit implicit casts from long to ulong. So either I have to cast after the multiplication as a whole, or before the multiplication. This way is less parentheses.

if ( !$IsWindows )
{
$totalAccelerators = 94
$totalAccelerators = 97
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We add three accelerators but two tests.

@vexx32
Copy link
Collaborator Author

vexx32 commented Aug 14, 2018

Appreciate the comments so far!

Okay, so updated the tests to match everything properly, and put in [ushort] for uint16 as well. I'm not sure of the prevalence of (U)int16 usage in PS, but I don't think making them a little more accessible to those coming from other languages is at all a bad idea. 😁

vexx32 added 2 commits August 14, 2018 07:48
Multiplier is a signed value (for no apparent reason)
As a result it has to be cast to unsigned. 
C# does not implicitly cast signed to unsigned values.
Protip: don't try to edit files from your phone ;)
@vexx32 vexx32 changed the title Adds 'u' suffix for UInt32 or UInt64 literal support Adds u suffix for UInt32 / UInt64 numeric literal support Aug 14, 2018
@vexx32
Copy link
Collaborator Author

vexx32 commented Aug 18, 2018

Currently (according to the sole remaining failing test) we're expecting hex literals to automatically read s UInt32 if they're over Int32 size range.... or something.

I think I can mimic that behaviour, but the question is -- should we? And if we do, then I can do a similar thing for UInt64 hex literals as well.

It's already in memory there as bigint anyway, may as well use it
@vexx32
Copy link
Collaborator Author

vexx32 commented Aug 19, 2018

OK So essentially if you have a hex value that is a multiple of 8 digits it expects a sign bit and adheres to it. Otherwise it assumes the value is positive when parsing (if you have a negative sign before the hex literal it is still respected and flips the final value).

It doesn't really make... sense... (in my opinion) for the value to go into doubles territory -- that's kind of against the point of using hex (unit precision) so since we already parse as bigint in this region (and always have, I've just kind of rolled it all together here) I figured if you're asking for a hex literal that big you might actually need it for some insane scenario and thought -- why not just yield the bigint value already there for the taking.

I'd love to hear your thoughts on it, though, as always! This is filtered through my own brain, so there may well be stuff in here that we should avoid doing for reasons I haven't thought of yet (yes bigint can cause OOM errors but the parser is already using it to check bounds anyway and always has; this won't introduce any new overflow possibilities that weren't already there just by virtue of entering stupidly huge decimals).

@iSazonov @rjmholt @lzybkr

image
image

And regarding that failing test on Get-Random:

PS C:\Users\Joel\source\repos\PSCore\PowerShell> get-random 0xfffffffffffffffffffffffffffffffff
5.04272284445114E+39

^^ that is why. 😄

@BrucePay
Copy link
Collaborator

@SteveL-MSFT I don't care about the back porting issue but I worry a great deal that so many changes to PowerShell's numeric processing will break backwards compatibility in a significant way. I think we need to revisit the impact of this change at next week's committee meeting.

@BrucePay BrucePay added Review - Committee The PR/Issue needs a review from the PowerShell Committee and removed Committee-Reviewed PS-Committee has reviewed this and made a decision labels Aug 19, 2018
@vexx32
Copy link
Collaborator Author

vexx32 commented Aug 19, 2018

Fun fun~ Keep me posted 😄

FWIW, I haven't really changed all that much from the outside looking in; it parses standard things just the same. The only significantly different behaviour (other than the new type suffixes) is that it doesn't just give up on large hex literals anymore. :)

And... though I need to draw up proper benchmarks... a few brief tests seem to indicate it's a bit quicker than before, which is usually a good thing. 👍

In keeping with existing parse logic from the IntXX parsers
(sort of; Int parsers are a bit funky with this stuff)
This behaviour is a bit more consistent.
@iSazonov
Copy link
Collaborator

iSazonov commented Aug 20, 2018

As for sign:

  1. Currently PowerShell implement
-0x1A

-26
  1. Currently PowerShell CAN consider hex numerics as signed
>0xFFFFFFFF

-1

>(0xFFFFFFFF).Gettype()

IsPublic IsSerial Name                                     BaseType
-------- -------- ----                                     --------
True     True     Int32                                    System.ValueType
  1. Current PowerShell behavior is to consider high bit as sign because target type always is signed - int32/64/long/double/BigInteger.

  2. If target type is unsigned (ushort,uint,ulong and so on) PowerShell MUST to consider high bit as value not sign
    The behavior is in CoreFX https://source.dot.net/#System.Private.CoreLib/shared/System/Number.Parsing.cs,890
    vs
    https://source.dot.net/#System.Runtime.Numerics/System/Numerics/BigNumber.cs,427

  3. Because we want add binary literals we should consider them here too. PowerShell must process binary literals in the same manner as hex literals.

@iSazonov
Copy link
Collaborator

@vexx32 The PR is very large (many comments and commits). I suggest split the PR to simplify.

  1. Make new PR with new prefixes and tests - it already was approved by PowerShell Committee and we can fast review and merge.
  2. Make new PR with NumberSuffixFlags in ScanNumberHelper (using switch) + replace Convert.ChangeType with new convert helper methods - the temporary PR without any fuctional changes can be fast reviewed and merged.

After that we can close current PR and open new one with cleaned commits based on PowerShell Committee conclusion - commits with new suffixes and perhaps changes related hex/binary sign.

@vexx32
Copy link
Collaborator Author

vexx32 commented Aug 20, 2018

RE: the hex behaviour -- I have most of that working, though I can see I need to add a case for Unsigned types to not treat the largest bit as the signing bit. NBD there.

As for the PR suggestions, yeah, I'll get on that. Shouldn't be too hard to pull off. 😄

EDIT: as per #7575 I have split off the type suffix and accelerator changes.

However, I'm not able to do the two-char type suffix without either converting suffix to string or the enum, so I opted to retain the changes that work with the enum specifically. Everything else remains the same.

Just for reference really.
[skipci]
@vexx32 vexx32 changed the title Add C#-style type accelerators and literal suffixes for short, ushort, uint, and ulong values Tokenizer Refactor Aug 21, 2018
@vexx32
Copy link
Collaborator Author

vexx32 commented Aug 21, 2018

(renamed for clarity, I went a bit beyond the original intent here) ;)

@rjmholt rjmholt changed the title Tokenizer Refactor Add unsigned and short numeric literal and type accelerator support Aug 21, 2018
@rjmholt rjmholt changed the title Add unsigned and short numeric literal and type accelerator support Refactor the tokenizer Aug 21, 2018
@Jaykul
Copy link
Contributor

Jaykul commented Aug 23, 2018

This seems like a REALLY good candidate for that feature flag testing that's been discussed before.

Not on the sense that is simple to toggle, but in the sense that this big of a change ought to be tested ... a lot ... In such a way that people can initially opt in to the new parser, then eventually make it opt-out ... And if all goes well, remove the old code?

@vexx32
Copy link
Collaborator Author

vexx32 commented Sep 5, 2018

FWIW; After submitting #7575 I've been toying with possible improvements and have a much cleaner and more polished version of this code at https://github.com/vexx32/PowerShell/tree/Tokenizer/Refactor

It's not yet a PR as it is a branch off of the code in #7575 and includes those changes already.

@SteveL-MSFT
Copy link
Member

@PowerShell/powershell-committee re-reviewed this and is fine with the proposal (this is not a PR review approval)

@SteveL-MSFT SteveL-MSFT added Committee-Reviewed PS-Committee has reviewed this and made a decision and removed Review - Committee The PR/Issue needs a review from the PowerShell Committee labels Sep 5, 2018
else if (c == 'g' || c == 'G') { multiplier = 1024 * 1024 * 1024; }
else if (c == 't' || c == 'T') { multiplier = 1024L * 1024 * 1024 * 1024; }
else if (c == 'p' || c == 'P') { multiplier = 1024L * 1024 * 1024 * 1024 * 1024; }
switch (c)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Going from 5 lines to 22 lines of code is an improvement? 🙂
Theoretically a switch might be faster if the jit generates a jump table, but if performance were the concern, I'd use a statically initialized array and replace these 5 lines of code with:

multiplier = multipliers[c];

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ooh, that's a good idea! I'll borrow that and submit it in the next PR. Feel free to review this further if you want, but be aware that 90% of the code here has been completely overhauled in the upcoming PR (and it has fancy tidbits with binary parsing and underscore support yay!) for y'all to have fun combing through as I re-learn my C# ;)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then maybe just close this PR? I'd rather not waste my time reviewing 90% throwaway code.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Roger that! :)

return true;
}

// Has double suffix but cannot be parsed as double.
Copy link
Contributor

@lzybkr lzybkr Sep 6, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Has decimal suffix so should not be promoted to double, an error should be reported instead.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haha! That comment is incorrect. It wouldn't get promoted; that would fall through the the result = null; return false; statements.

I'll... quietly make sure I've caught that in my more recent work.... XD

@vexx32
Copy link
Collaborator Author

vexx32 commented Sep 6, 2018

Closing this for now so we don't waste anymore time here. The base of the stuff that was meant to happen here is in #7575, and the rest needs a great deal of clean up that will be in an upcoming PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Breaking-Change breaking change that may affect users Committee-Reviewed PS-Committee has reviewed this and made a decision

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants