-
Notifications
You must be signed in to change notification settings - Fork 8.1k
Add Binary Parsing Support & Refactor TryGetNumberValue & ScanNumberHelper #7993
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
These are the benchmarks @HumanEquivalentUnit and I performed whilst working on this binary parser. Top notch work from him, frankly astonishing. The two intermediary methods were some of the better solutions of the 10 or 20 iterations of this code that we examined. https://gist.github.com/vexx32/852cd1d0ddc3e1a93f77b90ab23cfe3d Note that all cases were tested within ranges that Convert.ToIntX() methods are available, but the current method (labelled |
|
@vexx32 I'd prefer to see small PR - with |
|
I understand your viewpoint, but in order to permit I think the alternative would leave you guys with a lot more to review, only to possibly pull it out later. Also, it would make any further PRs significantly more complex. I felt it best to simplify the code paths as much as possible, in order to prevent the tokenizer becoming a complex morass of unmaintainable code. The only parts I could really strip out for you guys to simplify them is the underscore literal support, and that is... not a lot. :( |
|
Underscore and I suffix is not approved in related issue. So we have to wait. |
3946b0d to
5f36d13
Compare
|
Rebased to pick up CI fixes. 😄 |
iSazonov
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll continue after my computer rebooted :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is clear logic for sign but I think we need more feedback whether this is a good UX.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is consistent with the existing patterns for things like hexadecimal, but I would be inclined to agree that there could be improvements made here, should we want to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about:
return c & 0x30 || c & 0x31;There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's no reason we can't do this, but I don't see anything else in the file that behaves this way and felt it best to continue the prevalent pattern here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+30% performance :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on the experiment below my suggestion should be faster :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you meant c == 0x30 || c == 0x31? & does not return a straight true/false or 0 / 1
(we can only do this in the binary parse because we also & 0b1111 or strip the high bits by cast to (byte))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| return (s_traits[c] & CharTraits.BinaryDigit) != 0; | |
| internal static bool IsBinaryDigit(this char c) => c == '0' || c == '1'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@iSazonov honestly a lot of these could also be reformatted like this regardless. For example:
internal static bool IsDecimalDigit(this char c)
{
if (c < 128)
{
return (s_traits[c] & CharTraits.Digit) != 0;
}
return false;
}becomes
internal static bool IsDecimalDigit(this char c) => c < 128 && (s_traits[c] & CharTraits.Digit) != 0;Is this a worthwhile change to make to the common pattern of these methods?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think we can do this.
Pattern from CoreFX repo:
internal static bool IsBinaryDigit(this char c) => (uint)(c - '0') <= 1;
internal static bool IsDecimalDigit(this char c) => (uint)(c - '0') <= 9;There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done! :)
test/powershell/Modules/Microsoft.PowerShell.Utility/Get-Random.Tests.ps1
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why we use auto-decrement here? I'd prefer do "-1" in previous init line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It just felt clever to use autodecrement to change the variable from "size of array" to "last index of array".
It is not very clear, and could be outputByteWalker -= 1 in the following line, if that's more readable. Or like this?
Span<byte> outputBytes = new byte[(digits.Length + 7) / 8];
int outputByteWalker = outputBytes.Length - 1;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could use more informational name like outputByteWalkerLastIndex.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please don't use outputByteWalker-- here - very bad readability is way to bugs.
I'd add new local variable - one for size/length, second for last index if needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll take care of this momentarily. 😄
a5e01c5 to
053262c
Compare
1d83a13 to
7d4ff02
Compare
| case 16: // short | ||
| case 32: // int | ||
| case 64: // long | ||
| case 96: // decimal |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder - is this constants the same for all hardware platforms? Can we catch a problem on some ARMs? Or more general question - do we sure that PowerShell signed constants (binary and hex) is portable?
@mklement0 have you any thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To my knowledge, there aren't any supported platforms where these numbers will be different. However, if there is a programmatic way to get expected sign bit lengths, I would be more than open to checking against this instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vexx32 It is not low level problem because we say about digits - it can be PowerShell language propblem if ARM users expect sign bit on another position.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aye, that could definitely be a concern. I could always refactor this to incorporate this, but I'd need to know what to incorporate here to account for it. Not sure where to find this information at the moment. 🙁
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found nothing useful. I hope @mklement0 help.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vexx32: The always-positive logic is fairly easy to implement, if I understand things correctly:
From a hex string: prepend a 0:
// 0xff as 255; without the leading 0, the result would be -1
BigInteger.Parse("0ff",NumberStyles.AllowHexSpecifier)From a byte array - prepend a 0x byte (with Big-Endian ordering):
new BigInteger(new byte[] { 0x0, 0xff }, isBigEndian: true)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even easier, actually. Hex strings are currently given a leading zero by default with my current code. I am then stripping it out in cases where it appears appropriate to allow a sign bit. I would simply remove the additional logic.
And with binary, it would simply always mean invoking the BigInteger constructor with a value of true for the isUnsigned parameter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking further, I could even trim out the extra leading 0 if we eventually opt for parsing as unsigned, as it will be possible to simply instruct the BigInteger constructor to always read it as unsigned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay, on the topic of the breaking change. I know it's desirable to change it, but it seems a bit risky to make such a change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I figured that would probably be the case. It is a fairly straightforward follow up if we find reason to change it later.
I could even put it in as an experimental feature later on, I think.
| | (digits[byteWalker - 6] << 6) | ||
| | (digits[byteWalker - 5] << 5) | ||
| | (digits[byteWalker - 4] << 4)) | ||
| | (((digits[byteWalker - 3] << 3) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why we use the extra parentheses?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It lets us make (8x zero + 7x OR) operations down to (1x zeroing + 8x OR), and is faster.
The brackets are:
(byte) (
((7)|(6)|(5)|(4)) | ( ((3)|(2)|(1)|0) & 0b1111 )
)
This works because the bit-patterns in the input (chars '0', '1') are 00110000 and 00110001. That is a coincidence of them being ASCII 48, ASCII 49. We can take advantage of those bit patterns, and do less work. We could zero the 0011 on every character, like this more readable slower code, and then shift the least significant bit, and OR them all together:
(byte)(
((digits[byteWalker - 7] - '0') << 7) |
((digits[byteWalker - 6] - '0') << 6) |
((digits[byteWalker - 5] - '0') << 5) |
((digits[byteWalker - 4] - '0') << 4) |
((digits[byteWalker - 3] - '0') << 3) |
((digits[byteWalker - 2] - '0') << 2) |
((digits[byteWalker - 1] - '0') << 1) |
((digits[byteWalker - 0] - '0') )
)
But, for the 7,6,5,4 shifts, they will push the 0011 part far out of the way. When we cast to (byte) they get dropped. Saves 4x zeroing operations if we treat those as a group.
The 3,2,1,0 shifts will combine with each other, but the high 0011 bits would clash if we weren't careful. So we merge these as a group, then do a single & 0b1111 of this group, and save 3x zeroing.
Then +1x OR to merge the two halves. ((7)|(6)|(5)|(4)) | ( ((3)|(2)|(1)|0) & 0b1111 )
This is quite hard to explain, I hope this is readable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not ( (7)|(6)|(5)|(4)|(3)|(2)|(1)|(0) ) & 0b1111?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would not give the right answer, it would make the high bits from inputs 3,2,1,0 overwrite the data from 7,6,5,4 (then & 0b1111 would leave only 0-15 output values).
String input: "10101011"
index from right: 76543210
Bit patterns of shifted input characters,
we want just the single far right bit from each line,
all merged down:
0011000|7
001100|06
00110|005
0011|0004
|-------possible overlap when combining
001|1000|3
00|1100|02
0|0110|001
|0011|0000
^ removed by &0b1111 to prevent overlap
only in the 3,2,1,0 section
^ far left overlap is OK, dropped by (byte) cast
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for clarification!
I think we need reformat the code and add the comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would appreciate some suggestions on how to approach the reformatting. The actual behaviour of the code is documented in the comments (although perhaps that can also be improved upon).
I suspect it will be necessary to skirt StyleCop rules heavily to make this much more readable than it is?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- We could add the well-formatted sample
( (7)|(6)|(5)|(4) ) | ( ( (3)|(2)|(1)|(0) ) & 0b1111 )in your comment. - We could add white spaces in code for better readability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good!
I think this might annoy StyleCop, but... such is the price of readability. 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| return (s_traits[c] & CharTraits.BinaryDigit) != 0; | |
| internal static bool IsBinaryDigit(this char c) => c == '0' || c == '1'; |
test/powershell/Modules/Microsoft.PowerShell.Utility/Get-Random.Tests.ps1
Outdated
Show resolved
Hide resolved
|
@vexx32 I have no more ideas than to occupy your time :-) - thanks for great work! We are waiting PowerShell Committee. |
|
Appreciate you taking the time to help us make it better, Ilya! 😄 I might rebase once more to tidy the commit history a little if I get a moment, but otherwise I don't mind waiting till the Committee makes a decision here. |
|
Please don't rebase until maintainers ask you. |
|
This PR has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed if no further activity occurs within 10 days. |
|
@SteveL-MSFT @TravisEz13 @daxian-dbw Please review the PR. |
|
@TravisEz13 are we just pending Dongbo to get back from his well-deserved vacation and review? Any chance this will make it for 6.2, or should I expect it to wait till after that is released? :slight_smile: |
|
@vexx32 I'd like to get someone to review this that has some experience in the area. Or at least one more person from the PowerShell team. |
rjmholt
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me. Left a minor comment, but not a blocker
| return false; | ||
| } | ||
| // Return true if the character is a binary digit. | ||
| internal static bool IsBinaryDigit(this char c) => (uint)(c - '0') <= 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While this is true, it seems more straightforward to just have => c == '0' || c == '1';. It took me a minute or so to realise why this isn't a problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Perhaps the coercion + comparison has a performance advantage?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rjmholt I was following @iSazonov's lead on that one, he apparently pulled that from the CoreFX code somewhere or other. Knowing nothing about how these checks would be performed on a hardware level means I can only speculate on potential ways this might be better than the ultimately far more readable version you propose.
Should I change it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rjmholt The micro optimization come from CoreFX - 2 vs 3 operations - this gives huge perf win.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should I add a comment to this effect?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because we get the question the comment can help code readers.
| result = dmValue; | ||
| return true; | ||
| } | ||
| else if (Utils.TryCast(bigValue, out double d)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need for the else here; following the branch above always returns
|
@TravisEz13 Seems like CI got stuck here, sorry; could you kick it for me when you get some time? Cheers! 😊 |
|
@vexx32 The jobs never started. I cannot restart them without closing the PR and re-opening it. I've seen in a few cases where you have |
…hell into Tokenizer/Refactor
|
Because this is a breaking change, I'm scheduling it to be merged first thing during 6.3. |
|
@TravisEz13 Do you guys need anything more from me before you merge? 🙂 |
|
Perhaps @daxian-dbw could find something :-) |
|
@vexx32 Thanks for the great improvement! |
PR Summary
Fixes #7557
0b11010110nsuffix using BigInteger; details below.NumberFormatenum to specify hex/binary/base 10 for the tokenizer, replacing oldbool hex.nsuffix for all numeric literals to support returning value as aBigIntegerif requested. This bypasses the issue of large literals losing accuracy when they cast throughdouble.Binary / Hex Parsing Implementation
0xFFFFFFFF -eq ([int]-1)and0xFFFFFFFFFFFFFFFF -eq ([long]-1), but suffixingucreatesint.MaxValueandlong.MaxValue, respectively, instead.0b011111111unsigned suffix (or combination suffix that includesu) ignores sign bits, similar to how parsing a hex string using[Convert]::ToUint32()would do so.-prefix. This can result in positive numbers due to sign bits being permitted, just like hex literals.Refactored numeric tokenizer parsing
New flow:
real(.01,0.0, or0e0syntaxes)false.Double, and apply multiplier to value. If the TryParse fails, TryGetNumberValue returnsfalse.falseif the value exceeds the specified type's bounds.doublewhere no suffix has been applied.BigIntegerusing some special casing to retain original behaviours in int/long ranges.BigInteger.intlongdecimal(base-10 literals only)double(base-10 literals only)If the value is outsideBigIntegerfor binary or hex literals.longrange (for hex and binary) ordoublerange (for base 10), the parse will fail; higher values must be explicitly requested using then/NBigInteger suffix.This is a breaking change as binary literals are now read as numbers instead of generic tokens which could potentially have been used as function / cmdlet names or file names.
Notes:
The same issue is still under further discussion for underscore support in numeric literals and whether BigInteger parsing ought to be exposed to the user at all.Supporting underscore literals is a further breaking change causing some generic tokens likePer @SteveL-MSFT's comment this proposal was rejected.1_000_000to be read as numerals instead.Removing underscore support or preventing standard parsing from accepting BigInteger ranges is a relatively trivial matter. It is my personal opinion that there is no particular reason not to hand the user a BigInteger when they enter a sufficiently large literal, but I will defer to the PowerShell Committee's judgement on this.PR Checklist
[Change is not breaking](https://github.com/PowerShell/PowerShell/blob/master/.github/CONTRIBUTING.md#making-breaking-changes).h,.cpp,.cs,.ps1and.psm1files have the correct copyright headerWIP:to the beginning of the title and remove the prefix when the PR is ready.[feature]if the change is significant or affects feature tests/cc @iSazonov , @SteveL-MSFT as it pertains to the tagged-for-further-discussion #7557