Refactor the tokenizer #7509

vexx32 · 2018-08-13T04:09:51Z

PR Summary

Implements 'u' and 's' token case in the Tokenizer.cs and CharTraits.cs parsing which allows the use of uint or short literals as numbers with a 'u' and/or 's' suffix:

PS> $Value = 108u
PS> $Value.GetType().FullName
System.UInt32

PS> $Short = 105.5us
PS> $Short
106
PS> $Short.GetType().FullName
System.UInt16

I also tidied up some of the code in the vicinity to inline some of the very temporary variable definitions.

Comprehensive Summary:

Adds type suffixes for u and s - uint and short, respectively.
Adds some logic to test whether UInt32 or UInt64 is the appropriate return type by comparing the multiplied value against UIUnt32.MaxValue when neither l nor s is supplied.
Adds [short], [ushort], [uint], and [ulong] type accelerators to help out people coming from C# or other languages. Mapped to int16, uint16, uint32, and uint64, respectively.
Maintains parity with existing l / L suffix, where it will also perform rounding functions when applied to a real literal while also ensuring it still returns an appropriate uint or ulong value: 105.5u => 106
Defines a new flags enum NumberSuffixFlags that takes care of one-or-more numerical type suffixes.
Can explicitly parse ul-suffixed numeric values as ulong and us-suffixed values as ushort.
Explicitly specified length types for short and long as well as declared unsigned types will error out and register as an invalid numeric literal if their values after multiplication exceeds the type contraints. This is intentional; if the user specifies a value that cannot be satisfied e.g., 500sgb (500 short, gigabyte multiplier) then the parser should not be expected to figure out what they meant.
Converts strNum parameter of TryGetNumberValue to ReadOnlySpan<char> to hopefully get a little better performance as it flies through all the TryParses.

PR Checklist

PR has a meaningful title
- Use the present tense and imperative mood when describing your changes
Summarized changes
Make sure all .h, .cpp, .cs, .ps1 and .psm1 files have the correct copyright header
This PR is ready to merge and is not Work in Progress.
- If the PR is work in progress, please add the prefix WIP: to the beginning of the title and remove the prefix when the PR is ready.
User-facing changes
- User-facing Documentation needed
  - Issue filed - Issue link:
Testing - New and feature
- Make sure you've added a new test if existing tests do not effectively test the code changed
  - Add [feature] if the change is significant or affects feature tests

(not sure on all of those points; never worked on this repo before; feel free to let me know about stuff I'm missing)

Catch up to PowerShell/PowerShell/master

(for uint32 literals)

msftclas · 2018-08-13T04:10:02Z

All CLA requirements met.

Ignoring things that aren't as a result of these changes

iSazonov · 2018-08-13T09:47:26Z

@vexx32 Please add tests.

iSazonov · 2018-08-13T09:48:26Z

@SteveL-MSFT Should we get PowerShellCommittee approvement?

vexx32 · 2018-08-13T10:40:41Z

I'll get on that this evening. Just realised there's one more case it should handle; native ulong input when there's no multiplier.

You guys think it's worth looking at adding short type accelerators for [uint] and/or [ulong] as well to have parity with regular int accelerators?

vexx32 · 2018-08-13T16:14:25Z

OK Forgive me for not really.. being aware here.

Are we talking xUnit or Pester tests in this instance? The latter I can do easily, the former... well, I'll have to do some reading!

vexx32 · 2018-08-13T20:57:59Z

@iSazonov hopefully these tests are workable. I tried to place them in as sensible a place as I could find; with the other parser notation tests.

I've also modified the accelerator checks to accept the new [uint] and [ulong] type accelerators. I think.

vexx32 · 2018-08-13T22:41:58Z

Okay, I think I properly fixed both the tests and my ulong parsing. 😊

iSazonov · 2018-08-14T03:19:06Z

src/System.Management.Automation/engine/parser/tokenizer.cs

                        return false;
                    }

+                    if (suffix == 'u' || suffix == 'U')


Should this work the same?

(9.9l).GetType().Fullname (9.9u).GetType().Fullname

Suffix 'L' is in other code path - should we process 'U' in the same way?

I'm fairly confused why l and d are in separate code paths at all, honestly.

I would be more inclined to put it with d however as there is a pretty fundamental difference from int to uint which also means the actual bounds of the numerals are completely different to the int ranges.

But... My inclination for that would almost be that it would be better for it to either throw or produce an unsigned double, if there is such a thing. (afaik there is not)

Also, upon reviewing the code paths for l suffix... there are two of them. It seems a bit like there's a double-up on code paths there.

I'm not sure how the real boolean is determined, but l suffix has a code path inside the double parsing segment for that as well as one outside and after it.

Been thinking on this, and I think that it would probably be best to maintain parity with the existing L token suffix.

Adding a small branch for that code path in the if (real){} block as well, mimicking the doubling up the long parsing has. (I do wish there was a good way to consolidate them, but it looks like we're either gonna be doubling real checks for every parse or just doubling a couple of parse checks).

And done!

Had a slip up where I forgot my u token checks were before the check for real which caused a slight issue for a moment.

Also added a test or two to check the round-on-parse behaviour.

iSazonov · 2018-08-14T03:23:11Z

src/System.Management.Automation/engine/parser/tokenizer.cs

+                    {
+                        if (UInt32.TryParse(strNum, style, NumberFormatInfo.InvariantInfo, out uint u))
+                        {
+                            ulong testresult = u * (ulong)multiplier;


Seems the cast is unneeded.

Without it, it appears to convert to int32 or int64. From what I can tell, uint * int => int

Edit: Right, of course, that was before I put in the intermediary variable, yeah the cast shouldn't be needed.

So take 2:
No, the cast is necessary. C# does not permit implicit casts from long to ulong. So either I have to cast after the multiplication as a whole, or before the multiplication. This way is less parentheses.

iSazonov · 2018-08-14T03:30:54Z

test/powershell/Language/Parser/TypeAccelerator.Tests.ps1

            if ( !$IsWindows )
            {
-                $totalAccelerators = 94
+                $totalAccelerators = 97


We add three accelerators but two tests.

vexx32 · 2018-08-14T10:37:09Z

Appreciate the comments so far!

Okay, so updated the tests to match everything properly, and put in [ushort] for uint16 as well. I'm not sure of the prevalence of (U)int16 usage in PS, but I don't think making them a little more accessible to those coming from other languages is at all a bad idea. 😁

Multiplier is a signed value (for no apparent reason) As a result it has to be cast to unsigned. C# does not implicitly cast signed to unsigned values.

Protip: don't try to edit files from your phone ;)

vexx32 · 2018-08-18T20:23:39Z

Currently (according to the sole remaining failing test) we're expecting hex literals to automatically read s UInt32 if they're over Int32 size range.... or something.

I think I can mimic that behaviour, but the question is -- should we? And if we do, then I can do a similar thing for UInt64 hex literals as well.

It's already in memory there as bigint anyway, may as well use it

vexx32 · 2018-08-19T00:01:16Z

OK So essentially if you have a hex value that is a multiple of 8 digits it expects a sign bit and adheres to it. Otherwise it assumes the value is positive when parsing (if you have a negative sign before the hex literal it is still respected and flips the final value).

It doesn't really make... sense... (in my opinion) for the value to go into doubles territory -- that's kind of against the point of using hex (unit precision) so since we already parse as bigint in this region (and always have, I've just kind of rolled it all together here) I figured if you're asking for a hex literal that big you might actually need it for some insane scenario and thought -- why not just yield the bigint value already there for the taking.

I'd love to hear your thoughts on it, though, as always! This is filtered through my own brain, so there may well be stuff in here that we should avoid doing for reasons I haven't thought of yet (yes bigint can cause OOM errors but the parser is already using it to check bounds anyway and always has; this won't introduce any new overflow possibilities that weren't already there just by virtue of entering stupidly huge decimals).

@iSazonov @rjmholt @lzybkr

And regarding that failing test on Get-Random:

PS C:\Users\Joel\source\repos\PSCore\PowerShell> get-random 0xfffffffffffffffffffffffffffffffff
5.04272284445114E+39

^^ that is why. 😄

BrucePay · 2018-08-19T04:51:14Z

@SteveL-MSFT I don't care about the back porting issue but I worry a great deal that so many changes to PowerShell's numeric processing will break backwards compatibility in a significant way. I think we need to revisit the impact of this change at next week's committee meeting.

vexx32 · 2018-08-19T05:21:41Z

Fun fun~ Keep me posted 😄

FWIW, I haven't really changed all that much from the outside looking in; it parses standard things just the same. The only significantly different behaviour (other than the new type suffixes) is that it doesn't just give up on large hex literals anymore. :)

And... though I need to draw up proper benchmarks... a few brief tests seem to indicate it's a bit quicker than before, which is usually a good thing. 👍

In keeping with existing parse logic from the IntXX parsers (sort of; Int parsers are a bit funky with this stuff) This behaviour is a bit more consistent.

iSazonov · 2018-08-20T10:04:14Z

As for sign:

Currently PowerShell implement

-0x1A

-26

Currently PowerShell CAN consider hex numerics as signed

>0xFFFFFFFF

-1

>(0xFFFFFFFF).Gettype()

IsPublic IsSerial Name                                     BaseType
-------- -------- ----                                     --------
True     True     Int32                                    System.ValueType

Current PowerShell behavior is to consider high bit as sign because target type always is signed - int32/64/long/double/BigInteger.
If target type is unsigned (ushort,uint,ulong and so on) PowerShell MUST to consider high bit as value not sign
The behavior is in CoreFX https://source.dot.net/#System.Private.CoreLib/shared/System/Number.Parsing.cs,890
vs
https://source.dot.net/#System.Runtime.Numerics/System/Numerics/BigNumber.cs,427
Because we want add binary literals we should consider them here too. PowerShell must process binary literals in the same manner as hex literals.

iSazonov · 2018-08-20T10:13:57Z

@vexx32 The PR is very large (many comments and commits). I suggest split the PR to simplify.

Make new PR with new prefixes and tests - it already was approved by PowerShell Committee and we can fast review and merge.
Make new PR with NumberSuffixFlags in ScanNumberHelper (using switch) + replace Convert.ChangeType with new convert helper methods - the temporary PR without any fuctional changes can be fast reviewed and merged.

After that we can close current PR and open new one with cleaned commits based on PowerShell Committee conclusion - commits with new suffixes and perhaps changes related hex/binary sign.

vexx32 · 2018-08-20T13:22:59Z

RE: the hex behaviour -- I have most of that working, though I can see I need to add a case for Unsigned types to not treat the largest bit as the signing bit. NBD there.

As for the PR suggestions, yeah, I'll get on that. Shouldn't be too hard to pull off. 😄

EDIT: as per #7575 I have split off the type suffix and accelerator changes.

However, I'm not able to do the two-char type suffix without either converting suffix to string or the enum, so I opted to retain the changes that work with the enum specifically. Everything else remains the same.

Just for reference really. [skipci]

vexx32 · 2018-08-21T04:25:34Z

(renamed for clarity, I went a bit beyond the original intent here) ;)

Jaykul · 2018-08-23T06:31:17Z

This seems like a REALLY good candidate for that feature flag testing that's been discussed before.

Not on the sense that is simple to toggle, but in the sense that this big of a change ought to be tested ... a lot ... In such a way that people can initially opt in to the new parser, then eventually make it opt-out ... And if all goes well, remove the old code?

vexx32 · 2018-09-05T15:54:55Z

FWIW; After submitting #7575 I've been toying with possible improvements and have a much cleaner and more polished version of this code at https://github.com/vexx32/PowerShell/tree/Tokenizer/Refactor

It's not yet a PR as it is a branch off of the code in #7575 and includes those changes already.

SteveL-MSFT · 2018-09-05T22:38:32Z

@PowerShell/powershell-committee re-reviewed this and is fine with the proposal (this is not a PR review approval)

lzybkr · 2018-09-05T23:44:03Z

src/System.Management.Automation/engine/parser/tokenizer.cs

-                else if (c == 'g' || c == 'G') { multiplier = 1024 * 1024 * 1024; }
-                else if (c == 't' || c == 'T') { multiplier = 1024L * 1024 * 1024 * 1024; }
-                else if (c == 'p' || c == 'P') { multiplier = 1024L * 1024 * 1024 * 1024 * 1024; }
+                switch (c)


Going from 5 lines to 22 lines of code is an improvement? 🙂
Theoretically a switch might be faster if the jit generates a jump table, but if performance were the concern, I'd use a statically initialized array and replace these 5 lines of code with:

multiplier = multipliers[c];

Ooh, that's a good idea! I'll borrow that and submit it in the next PR. Feel free to review this further if you want, but be aware that 90% of the code here has been completely overhauled in the upcoming PR (and it has fancy tidbits with binary parsing and underscore support yay!) for y'all to have fun combing through as I re-learn my C# ;)

Then maybe just close this PR? I'd rather not waste my time reviewing 90% throwaway code.

Roger that! :)

lzybkr · 2018-09-06T00:03:00Z

src/System.Management.Automation/engine/parser/tokenizer.cs

                            return true;
                        }

+                        // Has double suffix but cannot be parsed as double.


Has decimal suffix so should not be promoted to double, an error should be reported instead.

Haha! That comment is incorrect. It wouldn't get promoted; that would fall through the the result = null; return false; statements.

I'll... quietly make sure I've caught that in my more recent work.... XD

vexx32 · 2018-09-06T00:34:07Z

Closing this for now so we don't waste anymore time here. The base of the stuff that was meant to happen here is in #7575, and the rest needs a great deal of clean up that will be in an upcoming PR.

vexx32 added 5 commits August 12, 2018 17:44

Merge pull request #1 from PowerShell/master

f3b0413

Catch up to PowerShell/PowerShell/master

Add tokenizer logic for recognising numeric tokens with 'u' suffix

c0c1248

(for uint32 literals)

inline very temporary variables

37802f0

Add typesuffix trait to uppercase and lowercase 'u'

dc9dc4c

Add case to get ulong vs uint depending on total value of number

88377e4

vexx32 requested review from BrucePay and daxian-dbw as code owners August 13, 2018 04:09

Fix some small things codefactor complained about

567809f

Ignoring things that aren't as a result of these changes

Add type mappings for uint and ulong accelerators

34812ba

vexx32 added 2 commits August 13, 2018 16:42

Add cases for new type accelerators

47287fb

Add parser tests for uint literals

da25c79

SteveL-MSFT added the Review - Committee The PR/Issue needs a review from the PowerShell Committee label Aug 13, 2018

vexx32 added 3 commits August 13, 2018 18:30

Fix typos in tests

8491b10

fix ulong parsing

cc90f72

Fix silly error

ee9c641

iSazonov reviewed Aug 14, 2018

View reviewed changes

vexx32 added 3 commits August 14, 2018 06:24

Remove unneeded cast

0d66604

Adding ushort accelerator as well (maybe unnecessary?)

3c56a94

Update accelerator tests

5146db0

vexx32 added 2 commits August 14, 2018 07:48

Fix type conversion

27ed4f0

Multiplier is a signed value (for no apparent reason) As a result it has to be cast to unsigned. C# does not implicitly cast signed to unsigned values.

Remove stray characters breaking hash literal

ace51dc

Protip: don't try to edit files from your phone ;)

vexx32 changed the title ~~Adds 'u' suffix for UInt32 or UInt64 literal support~~ Adds u suffix for UInt32 / UInt64 numeric literal support Aug 14, 2018

Remove busted logic for int hex sign bits that wasn't doing anything

d3618c7

Let huge hex values parse as bigint

f1411d5

It's already in memory there as bigint anyway, may as well use it

Commented out the test failing because the huge number can now be parsed

4a21a1e

BrucePay added Review - Committee The PR/Issue needs a review from the PowerShell Committee and removed Committee-Reviewed PS-Committee has reviewed this and made a decision labels Aug 19, 2018

Allow hex numbers to specify long values without suffix

5d2a39b

In keeping with existing parse logic from the IntXX parsers (sort of; Int parsers are a bit funky with this stuff) This behaviour is a bit more consistent.

Add case for dealing with unsigned ints

25913b4

Just for reference really. [skipci]

vexx32 mentioned this pull request Aug 20, 2018

Add C#-style type accelerators and suffixes for ushort, uint, ulong, and short literals. #7575

Closed

10 tasks

vexx32 changed the title ~~Add C#-style type accelerators and literal suffixes for short, ushort, uint, and ulong values~~ Tokenizer Refactor Aug 21, 2018

rjmholt changed the title ~~Tokenizer Refactor~~ Add unsigned and short numeric literal and type accelerator support Aug 21, 2018

rjmholt changed the title ~~Add unsigned and short numeric literal and type accelerator support~~ Refactor the tokenizer Aug 21, 2018

SteveL-MSFT added Committee-Reviewed PS-Committee has reviewed this and made a decision and removed Review - Committee The PR/Issue needs a review from the PowerShell Committee labels Sep 5, 2018

lzybkr reviewed Sep 5, 2018

View reviewed changes

lzybkr reviewed Sep 6, 2018

View reviewed changes

vexx32 closed this Sep 6, 2018

vexx32 mentioned this pull request Sep 18, 2018

Add C#-style type accelerators and suffixes for ushort, uint, ulong, and short literals #7813

Merged

10 tasks

vexx32 deleted the Tokenizer/UnsignedInt branch September 19, 2018 23:00

Refactor the tokenizer #7509

Refactor the tokenizer #7509

Uh oh!

Conversation

vexx32 commented Aug 13, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Summary

PR Checklist

Uh oh!

msftclas commented Aug 13, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

iSazonov commented Aug 13, 2018

Uh oh!

iSazonov commented Aug 13, 2018

Uh oh!

vexx32 commented Aug 13, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vexx32 commented Aug 13, 2018

Uh oh!

vexx32 commented Aug 13, 2018

Uh oh!

vexx32 commented Aug 13, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vexx32 Aug 14, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vexx32 Aug 14, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vexx32 Aug 14, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vexx32 Aug 14, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vexx32 commented Aug 14, 2018

Uh oh!

vexx32 commented Aug 18, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vexx32 commented Aug 19, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

BrucePay commented Aug 19, 2018

Uh oh!

vexx32 commented Aug 19, 2018

Uh oh!

iSazonov commented Aug 20, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

iSazonov commented Aug 20, 2018

Uh oh!

vexx32 commented Aug 20, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vexx32 commented Aug 21, 2018

Uh oh!

Jaykul commented Aug 23, 2018

Uh oh!

vexx32 commented Sep 5, 2018

Uh oh!

SteveL-MSFT commented Sep 5, 2018

vexx32 commented Aug 13, 2018 •

edited

Loading

msftclas commented Aug 13, 2018 •

edited

Loading

vexx32 commented Aug 13, 2018 •

edited

Loading

vexx32 Aug 14, 2018 •

edited

Loading

vexx32 Aug 14, 2018 •

edited

Loading

vexx32 Aug 14, 2018 •

edited

Loading

vexx32 Aug 14, 2018 •

edited

Loading

vexx32 commented Aug 18, 2018 •

edited

Loading

vexx32 commented Aug 19, 2018 •

edited

Loading

iSazonov commented Aug 20, 2018 •

edited

Loading

vexx32 commented Aug 20, 2018 •

edited

Loading

lzybkr Sep 6, 2018 •

edited

Loading