3

I am creating a REPL for Linux commands.

Since my grammar for command is call: WS? (redirection WS)* argument (WS atom)* WS?, once the parsing is done, I always find whitespace is included as one of the nodes in the parse tree. I understand including WS in the grammar to catch the command line correctly, but I want to filter out them after parsing.

I tried adding %ignore WS at the end of the file, but it didn't work.

3
  • 1
    Hello! You simply need to rename it to _WS, and Lark will filter them out automatically. Commented Nov 27, 2022 at 14:31
  • Thanks a lot! I don't know why this is the solution to my problem though, I appreciate it :) Commented Nov 28, 2022 at 16:13
  • 1
    Lark automatically removes every token that starts with _. Commented Nov 30, 2022 at 19:52

3 Answers 3

4

You can use a Transformer and have the method for the WS token return Discard.

Transformers make it much easier to convert the result of the parsing into the format that you need for the rest of your program. Since you didn't include your grammar, and your specific use case is too complex to replicate quickly, I'll show an example using the following basic grammar:

GRAMMAR = r"""
?start: ints
ints: (INT WS*)+
%import common (INT, WS)
"""

Before defining a transformer, we can see that all ints and spaces are present in the parsed tree:

>>> Lark(GRAMMAR).parse('12 34 56')
Tree(Token('RULE', 'ints'), [Token('INT', '12'), Token('WS', ' '), Token('INT', '34'), Token('WS', ' '), Token('INT', '56')])

We can define a simple transformer that only transforms WS:

from lark import Lark, Token, Transformer, Discard

class SpaceTransformer(Transformer):
    def WS(self, tok: Token):
        return Discard

Which results in the same tree as before, but now the WS tokens have been removed:

>>> tree = Lark(GRAMMAR).parse('12 34 56')

>>> SpaceTransformer().transform(tree)
Tree(Token('RULE', 'ints'), [Token('INT', '12'), Token('INT', '34'), Token('INT', '56')])

The transformer can be expanded further to handle more of the defined tokens:

class SpaceTransformer(Transformer):
    def WS(self, tok: Token):
        return Discard

    def INT(self, tok: Token) -> int:
        return int(tok.value)

That results in the values being proper integers, but they are still in the tree:

>>> tree = Lark(GRAMMAR).parse('12 34 56')

>>> SpaceTransformer().transform(tree)
Tree(Token('RULE', 'ints'), [12, 34, 56])

We can take it one step further and define a method for the rule as well - each method in a Transformer that matches a token or rule will automatically be called for each matching parsed value:

class SpaceTransformer(Transformer):
    def WS(self, tok: Token):
        return Discard

    def INT(self, tok: Token) -> int:
        return int(tok.value)

    def ints(self, integers):
        return integers

Now when we transform the tree, we get a list of ints instead of a tree:

>>> tree = Lark(GRAMMAR).parse('12 34 56')

>>> SpaceTransformer().transform(tree)
[12, 34, 56]

While my example used very simple types, you could define a method for your command rule that returns a Command object, or whatever you have defined to represent it. For rules that contain other rules, the outer rules will receive the already transformed objects, just like ints received int objects.

There are also some customizations you can apply to how the transformer methods receive arguments by using the v_args decorator.

Sign up to request clarification or add additional context in comments.

Comments

2

Rename your WS terminal to _WS. Lark automatically omits terminals whose names start with an underscore from the parse tree.

If you import WS, change the import to use the alias name _WS:

%import common.WS -> _WS

If you define WS manually, change the name in the definition to _WS:

_WS: /[ \t\f\r\n]/+

Example:

from lark import Lark

GRAMMAR_1 = r"""
%import common.WORD
%import common.WS
?start: (WORD WS*)+
"""

GRAMMAR_2 = r"""
%import common.WORD
%import common.WS -> _WS  // NOTE: using _WS alias
?start: (WORD _WS*)+
"""

>>> Lark(GRAMMAR_1).parse('a b c')
Tree(Token('RULE', 'start'), [Token('WORD', 'a'), Token('WS', ' '), Token('WORD', 'b'), Token('WS', ' '), Token('WORD', 'c')])

>>> Lark(GRAMMAR_2).parse('a b c')
Tree(Token('RULE', 'start'), [Token('WORD', 'a'), Token('WORD', 'b'), Token('WORD', 'c')])

If your grammar entirely whitespace-insensitive, consider %ignoreing whitespace everywhere instead, as suggested in ja2142's answer.


originally suggested by Erez in the comments

Comments

1

For a language without significant whitespace (I believe a bash-like language qualifies) it should be enough to add:

%import common.WS
%ignore WS

and remove all other whitespace handling from the grammar. For example:

call: WS? (redirection WS)* argument (WS atom)* WS?

would become simply:

call: redirection* argument atom*

In a grammar like that whitespace counts as a boundary between tokens, but it doesn't have to be handled further in either grammar definition or resulting parse tree.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.