-
Notifications
You must be signed in to change notification settings - Fork 8.1k
Refactor the tokenizer #7509
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor the tokenizer #7509
Conversation
Catch up to PowerShell/PowerShell/master
(for uint32 literals)
Ignoring things that aren't as a result of these changes
|
@vexx32 Please add tests. |
|
@SteveL-MSFT Should we get PowerShellCommittee approvement? |
|
I'll get on that this evening. Just realised there's one more case it should handle; native ulong input when there's no multiplier. You guys think it's worth looking at adding short type accelerators for |
|
OK Forgive me for not really.. being aware here. Are we talking xUnit or Pester tests in this instance? The latter I can do easily, the former... well, I'll have to do some reading! |
|
@iSazonov hopefully these tests are workable. I tried to place them in as sensible a place as I could find; with the other parser notation tests. I've also modified the accelerator checks to accept the new |
|
Okay, I think I properly fixed both the tests and my ulong parsing. 😊 |
| return false; | ||
| } | ||
|
|
||
| if (suffix == 'u' || suffix == 'U') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this work the same?
(9.9l).GetType().Fullname
(9.9u).GetType().FullnameSuffix 'L' is in other code path - should we process 'U' in the same way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fairly confused why l and d are in separate code paths at all, honestly.
I would be more inclined to put it with d however as there is a pretty fundamental difference from int to uint which also means the actual bounds of the numerals are completely different to the int ranges.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But... My inclination for that would almost be that it would be better for it to either throw or produce an unsigned double, if there is such a thing. (afaik there is not)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, upon reviewing the code paths for l suffix... there are two of them. It seems a bit like there's a double-up on code paths there.
I'm not sure how the real boolean is determined, but l suffix has a code path inside the double parsing segment for that as well as one outside and after it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Been thinking on this, and I think that it would probably be best to maintain parity with the existing L token suffix.
Adding a small branch for that code path in the if (real){} block as well, mimicking the doubling up the long parsing has. (I do wish there was a good way to consolidate them, but it looks like we're either gonna be doubling real checks for every parse or just doubling a couple of parse checks).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And done!
Had a slip up where I forgot my u token checks were before the check for real which caused a slight issue for a moment.
Also added a test or two to check the round-on-parse behaviour.
| { | ||
| if (UInt32.TryParse(strNum, style, NumberFormatInfo.InvariantInfo, out uint u)) | ||
| { | ||
| ulong testresult = u * (ulong)multiplier; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems the cast is unneeded.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without it, it appears to convert to int32 or int64. From what I can tell, uint * int => int
Edit: Right, of course, that was before I put in the intermediary variable, yeah the cast shouldn't be needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So take 2:
No, the cast is necessary. C# does not permit implicit casts from long to ulong. So either I have to cast after the multiplication as a whole, or before the multiplication. This way is less parentheses.
| if ( !$IsWindows ) | ||
| { | ||
| $totalAccelerators = 94 | ||
| $totalAccelerators = 97 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We add three accelerators but two tests.
|
Appreciate the comments so far! Okay, so updated the tests to match everything properly, and put in |
Multiplier is a signed value (for no apparent reason) As a result it has to be cast to unsigned. C# does not implicitly cast signed to unsigned values.
Protip: don't try to edit files from your phone ;)
u suffix for UInt32 / UInt64 numeric literal support
|
Currently (according to the sole remaining failing test) we're expecting hex literals to automatically read s I think I can mimic that behaviour, but the question is -- should we? And if we do, then I can do a similar thing for |
It's already in memory there as bigint anyway, may as well use it
|
OK So essentially if you have a hex value that is a multiple of 8 digits it expects a sign bit and adheres to it. Otherwise it assumes the value is positive when parsing (if you have a negative sign before the hex literal it is still respected and flips the final value). It doesn't really make... sense... (in my opinion) for the value to go into doubles territory -- that's kind of against the point of using hex (unit precision) so since we already parse as I'd love to hear your thoughts on it, though, as always! This is filtered through my own brain, so there may well be stuff in here that we should avoid doing for reasons I haven't thought of yet (yes And regarding that failing test on PS C:\Users\Joel\source\repos\PSCore\PowerShell> get-random 0xfffffffffffffffffffffffffffffffff
5.04272284445114E+39^^ that is why. 😄 |
|
@SteveL-MSFT I don't care about the back porting issue but I worry a great deal that so many changes to PowerShell's numeric processing will break backwards compatibility in a significant way. I think we need to revisit the impact of this change at next week's committee meeting. |
|
Fun fun~ Keep me posted 😄 FWIW, I haven't really changed all that much from the outside looking in; it parses standard things just the same. The only significantly different behaviour (other than the new type suffixes) is that it doesn't just give up on large hex literals anymore. :) And... though I need to draw up proper benchmarks... a few brief tests seem to indicate it's a bit quicker than before, which is usually a good thing. 👍 |
In keeping with existing parse logic from the IntXX parsers (sort of; Int parsers are a bit funky with this stuff) This behaviour is a bit more consistent.
|
As for sign:
-0x1A
-26
>0xFFFFFFFF
-1
>(0xFFFFFFFF).Gettype()
IsPublic IsSerial Name BaseType
-------- -------- ---- --------
True True Int32 System.ValueType
|
|
@vexx32 The PR is very large (many comments and commits). I suggest split the PR to simplify.
After that we can close current PR and open new one with cleaned commits based on PowerShell Committee conclusion - commits with new suffixes and perhaps changes related hex/binary sign. |
|
RE: the hex behaviour -- I have most of that working, though I can see I need to add a case for As for the PR suggestions, yeah, I'll get on that. Shouldn't be too hard to pull off. 😄 EDIT: as per #7575 I have split off the type suffix and accelerator changes. However, I'm not able to do the two-char type suffix without either converting |
Just for reference really. [skipci]
|
(renamed for clarity, I went a bit beyond the original intent here) ;) |
|
This seems like a REALLY good candidate for that feature flag testing that's been discussed before. Not on the sense that is simple to toggle, but in the sense that this big of a change ought to be tested ... a lot ... In such a way that people can initially opt in to the new parser, then eventually make it opt-out ... And if all goes well, remove the old code? |
|
FWIW; After submitting #7575 I've been toying with possible improvements and have a much cleaner and more polished version of this code at https://github.com/vexx32/PowerShell/tree/Tokenizer/Refactor It's not yet a PR as it is a branch off of the code in #7575 and includes those changes already. |
|
@PowerShell/powershell-committee re-reviewed this and is fine with the proposal (this is not a PR review approval) |
| else if (c == 'g' || c == 'G') { multiplier = 1024 * 1024 * 1024; } | ||
| else if (c == 't' || c == 'T') { multiplier = 1024L * 1024 * 1024 * 1024; } | ||
| else if (c == 'p' || c == 'P') { multiplier = 1024L * 1024 * 1024 * 1024 * 1024; } | ||
| switch (c) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Going from 5 lines to 22 lines of code is an improvement? 🙂
Theoretically a switch might be faster if the jit generates a jump table, but if performance were the concern, I'd use a statically initialized array and replace these 5 lines of code with:
multiplier = multipliers[c];
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ooh, that's a good idea! I'll borrow that and submit it in the next PR. Feel free to review this further if you want, but be aware that 90% of the code here has been completely overhauled in the upcoming PR (and it has fancy tidbits with binary parsing and underscore support yay!) for y'all to have fun combing through as I re-learn my C# ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then maybe just close this PR? I'd rather not waste my time reviewing 90% throwaway code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Roger that! :)
| return true; | ||
| } | ||
|
|
||
| // Has double suffix but cannot be parsed as double. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Has decimal suffix so should not be promoted to double, an error should be reported instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haha! That comment is incorrect. It wouldn't get promoted; that would fall through the the result = null; return false; statements.
I'll... quietly make sure I've caught that in my more recent work.... XD
|
Closing this for now so we don't waste anymore time here. The base of the stuff that was meant to happen here is in #7575, and the rest needs a great deal of clean up that will be in an upcoming PR. |


PR Summary
Implements 'u' and 's' token case in the Tokenizer.cs and CharTraits.cs parsing which allows the use of uint or short literals as numbers with a 'u' and/or 's' suffix:
I also tidied up some of the code in the vicinity to inline some of the very temporary variable definitions.
Comprehensive Summary:
uands-uintandshort, respectively.UInt32orUInt64is the appropriate return type by comparing the multiplied value againstUIUnt32.MaxValuewhen neitherlnorsis supplied.[short],[ushort],[uint], and[ulong]type accelerators to help out people coming from C# or other languages. Mapped toint16,uint16,uint32, anduint64, respectively.l/Lsuffix, where it will also perform rounding functions when applied to arealliteral while also ensuring it still returns an appropriateuintorulongvalue:105.5u => 106NumberSuffixFlagsthat takes care of one-or-more numerical type suffixes.ul-suffixed numeric values asulongandus-suffixed values asushort.shortandlongas well as declaredunsignedtypes will error out and register as an invalid numeric literal if their values after multiplication exceeds the type contraints. This is intentional; if the user specifies a value that cannot be satisfied e.g.,500sgb(500 short, gigabyte multiplier) then the parser should not be expected to figure out what they meant.strNumparameter ofTryGetNumberValuetoReadOnlySpan<char>to hopefully get a little better performance as it flies through all theTryParses.PR Checklist
.h,.cpp,.cs,.ps1and.psm1files have the correct copyright headerWIP:to the beginning of the title and remove the prefix when the PR is ready.[feature]if the change is significant or affects feature tests(not sure on all of those points; never worked on this repo before; feel free to let me know about stuff I'm missing)