sclang: Unicode mathematical operators#7425
Conversation
This pr creates a new lexer, replacing the old one with an easier to read and more resusable alternative. Fuzz testing with the old is also provided and has been ran for a while until no new issues were found.
|
That would be lovely. This means that symbols and strings also support unicode? |
|
They already do! Symbols only when declared like 'abcd', not \abcd. Strings work by gobbling everything until ' " ' is encountered, same with quoted symbols. I could add this to the slash symbol, but the logic there is already weird and complex so I'm reluctant to do so. |
|
Are you sure? If you call Do you get true? |
|
No, our string class isn't a text container, it's a byte container. Changing this is a huge breaking changing. What we could do is introduce a new string literal that turns into a text object, or make the old string syntax turn into a byte array. The former means people have to use this new string class everywhere, there latter means everyone needs to update their code. The latter is the 'correct' solution done by other languages like python. That is a separate pr though and not directly related to this, which is strictly about adding unicode operators to the language. It would also involve including some form of unicode library and we'd have to deal with complex things like graphemes, multicodepoints, and normalisation forms. This PR here is far smaller, only adding a little piece of unicode into the language. I meant you can store and concat unicode in symbols and strings, which is sufficient to do most things you'd want to do with selectors. |
That's impossible, $ is a char, you need an arbitrary amount of space to store a grapheme. Consider this thing, its only one 'character' (grapheme).... ḧ̶̶̴̷̶̵̸̴̷̷̶̸̷̵̷̸̷̷̸̶̷̸̵̶̷̸̨̢̢̡̨̧̡̢̢̢̧̡̢̢̨̧̨̧̢̢̢̨̨̧̢̧̧̛̛̟͚̯̳͍͔̜̞̯̭̙͓͓͍̹̤̱̘͕̮͎̳̰̜͍̗͍̬͎̰̝̟̫̞̱͕̟̺̺̜̟̞̤̝̜̥̼̳̟̬̲͖͓̪̠̖̼̗͈̦̤̳̝̪͔̦̗̠͙̺̰̥̹͎͉̩̺̯̳̟̭̥̠̱̱̬̥̻̲̖̯̼͓̬͕͖̼̮̣̬̠͍̖̬͇̮̭̭̞̳̪̜̞̪͉͔̩̩̺̙̗̼͓̲͙̩̪̩̬̠͔̱͉͕̪̳̲̥̟̺͍̙̠̱̝̗͖̠̜͙̰͙̦̙̼̹̖̮̜̹͍̘̠̱̼̗̺̟̰͚͕̹̪͕̹͕̝͍͎́̋̽͆́̈̅̑̌̌̊̄̍͗͒̀̋͛͐̆͆͑̃́̅̌̆́̂̊̆͛̓̀̄̀̔̉͑́̌͑̂̈́̿̌̂̊̈̈́̇̈́̃̋̉̀̋͗̈̏͂̍͆͑̆̎͐͂̈̽̍̌͌͒̏̓͌̓͒̾͊̓̒̈́̑̔̀̋̑̀̐̽͛̈̀͒͗̽͛̔̈́̉͋̈́͐́͛̉̓̈́͐͗̊̇̀̍͗͆̋̓́̈̌̐͛͊̃̅͊̔̄̿͋̅̈͛̈́̇̌͂̔̉͐͂͆̐̅̾̋͆͑̏̽͌̈͑̈́̋̽̅͆̓͆̽̓̊́̏̈́́̈̆̀̏͛̊̄̀̓̋̂́̊̇̽̓̂̄̽͐̓̽̚͘̕͘̕͘̚͘͘̚͘͘̚͜͜͜͜͜͜͜͠͠͠͝͝͝͠͝͝͠͝͝͝͝͠ͅͅͅͅͅ |
|
Yes, this is what I thought! For this reason I once had the idea to intriduce a class of nested string, which can deal both with this and with something like quotation levels (strings in strings in strings, like arrays). https://github.com/telephon/Strang Just for reference. |
|
I suppose that the lexer refactor commit is a preparational thing, independent o the new feature, right? |
|
Well it makes all the stuff easier, along with preparing to fix up the lexer/parser communicating, which is unsafe and I'm pretty certain wrong because it's a lookahead parser and we mutate global state. The new lexer works in codepoints, so it's trivial to do this kind of thing... And since all unicode use is currently royally broken, there is this nice unicode shaped hole to fill with goodies!!! I'm also planning (in various degrees of doneness):
|
|
Since this is a large change whose record will matter in the future, I'd suggest you split it up into two pull requests. I see no problem with either of them, but looking at 68007ba this seems absolutely simple, once all the rest is in place. |
|
#7394 already did that! |
|
ok thanks! |
Just a draft right now to demonstrate the new lexer.
This PR depends on that one, the actual change here is about 5 lines of code.
Purpose and Motivation
Adds mathematical operators to the supported binary operator list
The sc code here is just an example, I don't intend it seriously, but could stay?
The real conversation here is how to support unicode. The ICU has a huge document on this subject, its really complicated. I suggest we just selectively add bits as we need them, rather than trying to add everything!
For example, we probably want to forbid things like
−(unicode minus sign) because it looks the same as-(normal minus sign).One objection to this approach is that the codepoints added here are always binary operators, this means you can't do
∑(array), instead you'd have to write(∑)(array)orarray ∑ nilorarray ∑ (_ + _).I've added the Mathematical Operators block and Supplemental Mathematical Operators block to the binary operator list, you can find these here: https://en.wikipedia.org/wiki/Mathematical_operators_and_symbols_in_Unicode
We may also wish to consider how the user types these. Julia's vscode extension has a lovely unicode popup when you type
\. I.e. turning\suminto∑. We might want to do the same in scide for\\?This works as expected.
Note that
∋∀is a unique operator, not two.Types of changes
To-do list