Add Binary Parsing Support & Refactor TryGetNumberValue & ScanNumberHelper #7993

vexx32 · 2018-10-11T14:06:22Z

PR Summary

Fixes #7557

Adds support for binary parsing in format echoing hex: 0b11010110
- Works with all existing type suffixes and multipliers.
- Supports arbitrary length parsing with n suffix using BigInteger; details below.
Adds NumberFormat enum to specify hex/binary/base 10 for the tokenizer, replacing old bool hex.
Adds n suffix for all numeric literals to support returning value as a BigInteger if requested. This bypasses the issue of large literals losing accuracy when they cast through double.
Adds tests for all new behaviours.

Binary / Hex Parsing Implementation

Mimics old sign bit behaviour for int and long types. Sign bits accepted for 8 or 16-bit Hex parsing, and 8, 16, 32, 64 for binary.
- i.e., 0xFFFFFFFF -eq ([int]-1) and 0xFFFFFFFFFFFFFFFF -eq ([long]-1), but suffixing u creates int.MaxValue and long.MaxValue, respectively, instead.
Sign bits higher than this are accepted for bigint-suffixed numerals:
- Hex: Bigint-suffixed hex treats the high bit of any literal with a length multiple of 8 as the sign bit
- Binary: Bigint-suffixed binary accepts sign bits at 96 and 128 chars, and from there on every 8 characters.
- Prefixing the literal with a 0 will bypass this and be treated as unsigned, e.g. 0b011111111
Specifying an unsigned suffix (or combination suffix that includes u) ignores sign bits, similar to how parsing a hex string using [Convert]::ToUint32() would do so.
Supports negating literals using - prefix. This can result in positive numbers due to sign bits being permitted, just like hex literals.

Refactored numeric tokenizer parsing

New flow:

Check for real (.01, 0.0, or 0e0 syntaxes)
1. If the decimal suffix is present, TryParse directly into decimal. If the TryParse fails, TryGetNumberValue returns false.
2. TryParse as Double, and apply multiplier to value. If the TryParse fails, TryGetNumberValue returns false.
  1. Check type suffixes and attempt to cast into appropriate type. This will return false if the value exceeds the specified type's bounds.
  2. Default to parsing as double where no suffix has been applied.
Check number format.
- If binary, manually parse into BigInteger using optimized helper function to directly construct the BigInteger bytes from the string.
- If hex, TryParse into BigInteger using some special casing to retain original behaviours in int/long ranges.
- If neither binary nor hex, TryParse normally as a BigInteger.
Apply multiplier value before attempting any casts to ensure type bounds can be appropriately checked without overflows.
Check type suffixes.
- If a specific type suffix is used, check type bounds and attempt to parse into that type.
  - If the value exceeds the type's available values, the parse fails. Otherwise, a straight cast is performed.
If no suffix is used, the following types are bounds-checked, in order, resulting in the first successful test determining the type of the number.
- int
- long
- decimal (base-10 literals only)
- double (base-10 literals only)
- ~~BigInteger for binary or hex literals.~~ If the value is outside long range (for hex and binary) or double range (for base 10), the parse will fail; higher values must be explicitly requested using the n/N BigInteger suffix.

This is a breaking change as binary literals are now read as numbers instead of generic tokens which could potentially have been used as function / cmdlet names or file names.

Notes:

Binary literal support was approved by the committee in Allow binary numbers, byte literals, and BigInteger? #7557
~~The same issue is still under further discussion for underscore support in numeric literals and whether BigInteger parsing ought to be exposed to the user at all.~~
- ~~Supporting underscore literals is a further breaking change causing some generic tokens like 1_000_000 to be read as numerals instead.~~ Per @SteveL-MSFT's comment this proposal was rejected.
- Removing underscore support or preventing standard parsing from accepting BigInteger ranges is a relatively trivial matter. It is my personal opinion that there is no particular reason not to hand the user a BigInteger when they enter a sufficiently large literal, but I will defer to the PowerShell Committee's judgement on this.

PR Checklist

/cc @iSazonov , @SteveL-MSFT as it pertains to the tagged-for-further-discussion #7557

vexx32 · 2018-10-11T14:18:21Z

These are the benchmarks @HumanEquivalentUnit and I performed whilst working on this binary parser. Top notch work from him, frankly astonishing. The two intermediary methods were some of the better solutions of the 10 or 20 iterations of this code that we examined.

https://gist.github.com/vexx32/852cd1d0ddc3e1a93f77b90ab23cfe3d

Note that all cases were tested within ranges that Convert.ToIntX() methods are available, but the current method (labelled BigEndianBinaryOr in this table) generalises seamlessly to arbitrary-length binary strings.

iSazonov · 2018-10-11T14:35:28Z

@vexx32 I'd prefer to see small PR - with 0b only. We could more fast review.

vexx32 · 2018-10-11T14:38:14Z

I understand your viewpoint, but in order to permit 0b parsing a majority of these changes are required -- or a significant amount of redundant code would need to be in place, because TryParse() methods are not capable of parsing binary.

I think the alternative would leave you guys with a lot more to review, only to possibly pull it out later. Also, it would make any further PRs significantly more complex. I felt it best to simplify the code paths as much as possible, in order to prevent the tokenizer becoming a complex morass of unmaintainable code.

The only parts I could really strip out for you guys to simplify them is the underscore literal support, and that is... not a lot. :(

iSazonov · 2018-10-11T16:01:16Z

Underscore and I suffix is not approved in related issue. So we have to wait.

vexx32 · 2018-10-12T00:27:23Z

Rebased to pick up CI fixes. 😄

iSazonov

I'll continue after my computer rebooted :-)

src/System.Management.Automation/engine/Utils.cs

src/System.Management.Automation/engine/parser/tokenizer.cs

iSazonov · 2018-10-12T12:50:25Z

src/System.Management.Automation/engine/Utils.cs

It is clear logic for sign but I think we need more feedback whether this is a good UX.

This is consistent with the existing patterns for things like hexadecimal, but I would be inclined to agree that there could be improvements made here, should we want to.

src/System.Management.Automation/engine/Utils.cs

iSazonov · 2018-10-14T17:01:13Z

src/System.Management.Automation/engine/parser/CharTraits.cs

What about:

return c & 0x30 || c & 0x31;

There's no reason we can't do this, but I don't see anything else in the file that behaves this way and felt it best to continue the prevalent pattern here?

+30% performance :-)

Based on the experiment below my suggestion should be faster :-)

Do you meant c == 0x30 || c == 0x31? & does not return a straight true/false or 0 / 1
(we can only do this in the binary parse because we also & 0b1111 or strip the high bits by cast to (byte))

Suggested change

return (s_traits[c] & CharTraits.BinaryDigit) != 0;

internal static bool IsBinaryDigit(this char c) => c == '0' || c == '1';

@iSazonov honestly a lot of these could also be reformatted like this regardless. For example:

internal static bool IsDecimalDigit(this char c) { if (c < 128) { return (s_traits[c] & CharTraits.Digit) != 0; } return false; }

becomes

internal static bool IsDecimalDigit(this char c) => c < 128 && (s_traits[c] & CharTraits.Digit) != 0;

Is this a worthwhile change to make to the common pattern of these methods?

Yes, I think we can do this.
Pattern from CoreFX repo:

internal static bool IsBinaryDigit(this char c) => (uint)(c - '0') <= 1; internal static bool IsDecimalDigit(this char c) => (uint)(c - '0') <= 9;

test/powershell/Modules/Microsoft.PowerShell.Utility/Get-Random.Tests.ps1

src/System.Management.Automation/engine/Utils.cs

iSazonov · 2018-10-15T03:37:53Z

src/System.Management.Automation/engine/Utils.cs

Why we use auto-decrement here? I'd prefer do "-1" in previous init line.

It just felt clever to use autodecrement to change the variable from "size of array" to "last index of array".
It is not very clear, and could be outputByteWalker -= 1 in the following line, if that's more readable. Or like this?

Span<byte> outputBytes = new byte[(digits.Length + 7) / 8]; int outputByteWalker = outputBytes.Length - 1;

We could use more informational name like outputByteWalkerLastIndex.

src/System.Management.Automation/engine/Utils.cs

iSazonov · 2018-10-15T15:38:38Z

src/System.Management.Automation/engine/Utils.cs

Please don't use outputByteWalker-- here - very bad readability is way to bugs.
I'd add new local variable - one for size/length, second for last index if needed.

I'll take care of this momentarily. 😄

src/System.Management.Automation/engine/Utils.cs

iSazonov · 2018-10-17T04:45:32Z

src/System.Management.Automation/engine/Utils.cs

+                        case 16: // short
+                        case 32: // int
+                        case 64: // long
+                        case 96: // decimal


I wonder - is this constants the same for all hardware platforms? Can we catch a problem on some ARMs? Or more general question - do we sure that PowerShell signed constants (binary and hex) is portable?

@mklement0 have you any thoughts?

To my knowledge, there aren't any supported platforms where these numbers will be different. However, if there is a programmatic way to get expected sign bit lengths, I would be more than open to checking against this instead.

@vexx32 It is not low level problem because we say about digits - it can be PowerShell language propblem if ARM users expect sign bit on another position.

Aye, that could definitely be a concern. I could always refactor this to incorporate this, but I'd need to know what to incorporate here to account for it. Not sure where to find this information at the moment. 🙁

I found nothing useful. I hope @mklement0 help.

@vexx32: The always-positive logic is fairly easy to implement, if I understand things correctly:

From a hex string: prepend a 0:

// 0xff as 255; without the leading 0, the result would be -1 BigInteger.Parse("0ff",NumberStyles.AllowHexSpecifier)

From a byte array - prepend a 0x byte (with Big-Endian ordering):

new BigInteger(new byte[] { 0x0, 0xff }, isBigEndian: true)

Even easier, actually. Hex strings are currently given a leading zero by default with my current code. I am then stripping it out in cases where it appears appropriate to allow a sign bit. I would simply remove the additional logic.

And with binary, it would simply always mean invoking the BigInteger constructor with a value of true for the isUnsigned parameter.

Thinking further, I could even trim out the extra leading 0 if we eventually opt for parsing as unsigned, as it will be possible to simply instruct the BigInteger constructor to always read it as unsigned.

Sorry for the delay, on the topic of the breaking change. I know it's desirable to change it, but it seems a bit risky to make such a change.

I figured that would probably be the case. It is a fairly straightforward follow up if we find reason to change it later.

I could even put it in as an experimental feature later on, I think.

iSazonov · 2018-10-17T04:55:28Z

src/System.Management.Automation/engine/Utils.cs

+                        | (digits[byteWalker - 6] << 6)
+                        | (digits[byteWalker - 5] << 5)
+                        | (digits[byteWalker - 4] << 4))
+                        | (((digits[byteWalker - 3] << 3)


Why we use the extra parentheses?

It lets us make (8x zero + 7x OR) operations down to (1x zeroing + 8x OR), and is faster.

The brackets are:

(byte) ( ((7)|(6)|(5)|(4)) | ( ((3)|(2)|(1)|0) & 0b1111 ) )

This works because the bit-patterns in the input (chars '0', '1') are 00110000 and 00110001. That is a coincidence of them being ASCII 48, ASCII 49. We can take advantage of those bit patterns, and do less work. We could zero the 0011 on every character, like this more readable slower code, and then shift the least significant bit, and OR them all together:

(byte)( ((digits[byteWalker - 7] - '0') << 7) | ((digits[byteWalker - 6] - '0') << 6) | ((digits[byteWalker - 5] - '0') << 5) | ((digits[byteWalker - 4] - '0') << 4) | ((digits[byteWalker - 3] - '0') << 3) | ((digits[byteWalker - 2] - '0') << 2) | ((digits[byteWalker - 1] - '0') << 1) | ((digits[byteWalker - 0] - '0') ) )

But, for the 7,6,5,4 shifts, they will push the 0011 part far out of the way. When we cast to (byte) they get dropped. Saves 4x zeroing operations if we treat those as a group.

The 3,2,1,0 shifts will combine with each other, but the high 0011 bits would clash if we weren't careful. So we merge these as a group, then do a single & 0b1111 of this group, and save 3x zeroing.

Then +1x OR to merge the two halves. ((7)|(6)|(5)|(4)) | ( ((3)|(2)|(1)|0) & 0b1111 )

This is quite hard to explain, I hope this is readable.

Why not ( (7)|(6)|(5)|(4)|(3)|(2)|(1)|(0) ) & 0b1111?

That would not give the right answer, it would make the high bits from inputs 3,2,1,0 overwrite the data from 7,6,5,4 (then & 0b1111 would leave only 0-15 output values).

String input: "10101011" index from right: 76543210 Bit patterns of shifted input characters, we want just the single far right bit from each line, all merged down: 0011000|7 001100|06 00110|005 0011|0004 |-------possible overlap when combining 001|1000|3 00|1100|02 0|0110|001 |0011|0000 ^ removed by &0b1111 to prevent overlap only in the 3,2,1,0 section ^ far left overlap is OK, dropped by (byte) cast

Thanks for clarification!

I think we need reformat the code and add the comment.

I would appreciate some suggestions on how to approach the reformatting. The actual behaviour of the code is documented in the comments (although perhaps that can also be improved upon).

I suspect it will be necessary to skirt StyleCop rules heavily to make this much more readable than it is?

We could add the well-formatted sample ( (7)|(6)|(5)|(4) ) | ( ( (3)|(2)|(1)|(0) ) & 0b1111 ) in your comment.

We could add white spaces in code for better readability.

Sounds good!

I think this might annoy StyleCop, but... such is the price of readability. 😄

iSazonov · 2018-10-17T05:02:30Z

src/System.Management.Automation/engine/parser/CharTraits.cs

Suggested change

return (s_traits[c] & CharTraits.BinaryDigit) != 0;

internal static bool IsBinaryDigit(this char c) => c == '0' || c == '1';

test/powershell/Modules/Microsoft.PowerShell.Utility/Get-Random.Tests.ps1

src/System.Management.Automation/engine/Utils.cs

iSazonov · 2018-10-17T13:48:40Z

@vexx32 I have no more ideas than to occupy your time :-) - thanks for great work! We are waiting PowerShell Committee.

vexx32 · 2018-10-17T14:06:04Z

Appreciate you taking the time to help us make it better, Ilya! 😄

I might rebase once more to tidy the commit history a little if I get a moment, but otherwise I don't mind waiting till the Committee makes a decision here.

iSazonov · 2018-10-17T14:08:50Z

Please don't rebase until maintainers ask you.

stale · 2018-11-23T11:18:32Z

This PR has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed if no further activity occurs within 10 days.
Thank you for your contributions.
Community members are welcome to grab these works.

iSazonov · 2019-02-01T05:41:37Z

@SteveL-MSFT @TravisEz13 @daxian-dbw Please review the PR.

vexx32 · 2019-02-11T22:25:25Z

@TravisEz13 are we just pending Dongbo to get back from his well-deserved vacation and review?

Any chance this will make it for 6.2, or should I expect it to wait till after that is released? :slight_smile:

TravisEz13 · 2019-02-11T22:52:28Z

@vexx32 I'd like to get someone to review this that has some experience in the area. Or at least one more person from the PowerShell team.

rjmholt

This looks good to me. Left a minor comment, but not a blocker

rjmholt · 2019-02-11T23:38:58Z

src/System.Management.Automation/engine/parser/CharTraits.cs

-            return false;
-        }
+        // Return true if the character is a binary digit.
+        internal static bool IsBinaryDigit(this char c) => (uint)(c - '0') <= 1;


While this is true, it seems more straightforward to just have => c == '0' || c == '1';. It took me a minute or so to realise why this isn't a problem.

(Perhaps the coercion + comparison has a performance advantage?)

@rjmholt I was following @iSazonov's lead on that one, he apparently pulled that from the CoreFX code somewhere or other. Knowing nothing about how these checks would be performed on a hardware level means I can only speculate on potential ways this might be better than the ultimately far more readable version you propose.

Should I change it?

@rjmholt The micro optimization come from CoreFX - 2 vs 3 operations - this gives huge perf win.

Should I add a comment to this effect?

Because we get the question the comment can help code readers.

rjmholt · 2019-02-11T23:45:36Z

src/System.Management.Automation/engine/parser/tokenizer.cs

+                                    result = dmValue;
+                                    return true;
+                                }
+                                else if (Utils.TryCast(bigValue, out double d))


No need for the else here; following the branch above always returns

vexx32 · 2019-02-12T04:53:25Z

@TravisEz13 Seems like CI got stuck here, sorry; could you kick it for me when you get some time? Cheers! 😊

TravisEz13 · 2019-02-12T17:13:44Z

@vexx32 The jobs never started. I cannot restart them without closing the PR and re-opening it. I've seen in a few cases where you have force-pushed that the PR cannot be reopened. The easiest way is to force push again. For example, make a whitespace change to one of your commit messages and push.

…hell into Tokenizer/Refactor

TravisEz13 · 2019-02-13T04:27:23Z

Because this is a breaking change, I'm scheduling it to be merged first thing during 6.3.

vexx32 · 2019-04-02T18:04:05Z

@TravisEz13 Do you guys need anything more from me before you merge? 🙂

iSazonov · 2019-04-03T05:13:53Z

Perhaps @daxian-dbw could find something :-)

iSazonov · 2019-04-04T03:14:10Z

@vexx32 Thanks for the great improvement!

vexx32 requested review from BrucePay and daxian-dbw as code owners October 11, 2018 14:06

vexx32 mentioned this pull request Oct 11, 2018

Refactor module version/GUID comparison logic #7125

Merged

11 tasks

vexx32 force-pushed the Tokenizer/Refactor branch from 3946b0d to 5f36d13 Compare October 12, 2018 00:24

iSazonov previously requested changes Oct 12, 2018

View reviewed changes

src/System.Management.Automation/engine/Utils.cs Outdated Show resolved Hide resolved

src/System.Management.Automation/engine/Utils.cs Outdated Show resolved Hide resolved

src/System.Management.Automation/engine/Utils.cs Outdated Show resolved Hide resolved

iSazonov reviewed Oct 12, 2018

View reviewed changes

src/System.Management.Automation/engine/Utils.cs Outdated Show resolved Hide resolved

iSazonov reviewed Oct 14, 2018

View reviewed changes

iSazonov reviewed Oct 15, 2018

View reviewed changes

vexx32 force-pushed the Tokenizer/Refactor branch from a5e01c5 to 053262c Compare October 15, 2018 17:48

iSazonov reviewed Oct 16, 2018

View reviewed changes

src/System.Management.Automation/engine/Utils.cs Outdated Show resolved Hide resolved

src/System.Management.Automation/engine/Utils.cs Outdated Show resolved Hide resolved

HumanEquivalentUnit reviewed Oct 16, 2018

View reviewed changes

src/System.Management.Automation/engine/Utils.cs Outdated Show resolved Hide resolved

vexx32 force-pushed the Tokenizer/Refactor branch from 1d83a13 to 7d4ff02 Compare October 16, 2018 14:12

iSazonov reviewed Oct 17, 2018

View reviewed changes

iSazonov added Review - Committee The PR/Issue needs a review from the PowerShell Committee Breaking-Change breaking change that may affect users Documentation Needed in this repo Documentation is needed in this repo labels Oct 17, 2018

iSazonov reviewed Oct 17, 2018

View reviewed changes

src/System.Management.Automation/engine/Utils.cs Outdated Show resolved Hide resolved

mklement0 mentioned this pull request Oct 22, 2018

Allow binary numbers, byte literals, and BigInteger? #7557

Closed

stale bot added the Stale label Nov 23, 2018

iSazonov requested a review from SteveL-MSFT February 1, 2019 05:40

TravisEz13 approved these changes Feb 2, 2019

View reviewed changes

rjmholt approved these changes Feb 11, 2019

View reviewed changes

vexx32 added 2 commits February 11, 2019 20:23

♻️ Address Rob's feedback

8822139

📝 Add comment to clarify why less readable method is used.

ec2e105

vexx32 added 2 commits February 12, 2019 12:33

Empty commit to restart CI

07649fb

Merge branch 'Tokenizer/Refactor' of https://github.com/vexx32/PowerS…

c2c5e64

…hell into Tokenizer/Refactor

TravisEz13 added this to the 6.3.0-preview.1 milestone Feb 13, 2019

Merge branch 'master' into Tokenizer/Refactor

4f4cf22

vexx32 mentioned this pull request Mar 21, 2019

Implement Get-Random -Count without specifying an InputObject list #9111

Merged

11 tasks

iSazonov added the CL-Engine Indicates that a PR should be marked as an engine change in the Change Log label Apr 3, 2019

TravisEz13 merged commit 3f52adb into PowerShell:master Apr 3, 2019

vexx32 deleted the Tokenizer/Refactor branch April 4, 2019 20:53

sdwheeler mentioned this pull request Apr 9, 2019

Add documentation for binary literals MicrosoftDocs/PowerShell-Docs#4143

Closed

8 tasks

vexx32 mentioned this pull request Feb 11, 2020

Negative hex literal int16 throws parse error #11823

Closed

vexx32 mentioned this pull request Jul 20, 2020

Number literals: parse integer values too large to fit into Decimal as BigInteger (bigint), not Double #13212

Closed

vexx32 mentioned this pull request Mar 8, 2021

why i hate the dot #14963

Closed

vexx32 mentioned this pull request Jun 6, 2022

Binary AND returns incorrect value #4031

Closed

mklement0 mentioned this pull request Oct 27, 2023

Digit Separators #20589

Closed

	return (s_traits[c] & CharTraits.BinaryDigit) != 0;
	internal static bool IsBinaryDigit(this char c) => c == '0' \|\| c == '1';

Add Binary Parsing Support & Refactor TryGetNumberValue & ScanNumberHelper #7993

Add Binary Parsing Support & Refactor TryGetNumberValue & ScanNumberHelper #7993

Uh oh!

Conversation

vexx32 commented Oct 11, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Summary

Binary / Hex Parsing Implementation

Refactored numeric tokenizer parsing

PR Checklist

Uh oh!

vexx32 commented Oct 11, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

iSazonov commented Oct 11, 2018

Uh oh!

vexx32 commented Oct 11, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

iSazonov commented Oct 11, 2018

Uh oh!

vexx32 commented Oct 12, 2018

Uh oh!

iSazonov left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

iSazonov Oct 15, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vexx32 Oct 15, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

iSazonov Oct 17, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vexx32 commented Oct 11, 2018 •

edited

Loading

vexx32 commented Oct 11, 2018 •

edited

Loading

vexx32 commented Oct 11, 2018 •

edited

Loading

iSazonov Oct 15, 2018 •

edited

Loading

vexx32 Oct 15, 2018 •

edited

Loading

iSazonov Oct 17, 2018 •

edited

Loading

vexx32 Oct 17, 2018 •

edited

Loading

iSazonov Oct 17, 2018 •

edited

Loading

HumanEquivalentUnit Oct 17, 2018 •

edited

Loading