Lark matching custom delimiter multiline strings

Question

I am trying to use lark to extract some information from perl files. For that, I need a basic understanding of what a statement is. The issue I came across are "Here Document" strings. I would describe them as multiline strings with custom delimiters, like:

$my_var .= << 'anydelim';
some things
other things
anydelim

While writing down this question, I figured out a solution using a regex with backreferences / named references. Since I could not find any similar question, I decided to post the question and answer it myself.

If anyone knows any other method (like a way to use back references across multiple lark rules), please let me know!

user766308 · Accepted Answer · 2021-08-17 07:54:39Z

1

A solution using a regexp. Key ingredients:

back references, in this case named references
the /s modifier (causes . to also match newlines
.*? to match non greedy (otherwise it would also consume the delimiter)

from lark import Lark

block_grammar = r"""
    %import common.WS
    %ignore WS
    delimited_string: "<<" /(?P<quote>['"])(?P<delimiter>[A-Za-z_]+)(?P=quote)\;.*?(?P=delimiter)/s
"""
minimal_parser = Lark(block_grammar, start="delimited_string")

ast = minimal_parser.parse(r"""
    << 'SomeDelim'; fasdfasdf 
    fddfsdg SomeDelim
""")
print(ast.pretty())

answered Aug 17, 2021 at 7:54

user766308

1237 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

MegaIng · Accepted Answer · 2021-08-17 20:23:22Z

1

If you want to do complex back references across multiple Terminals, e.g. you can't use a single regex, you need to use a PostLexer (or worst case, a Custom lexer). A Small example with a XML like structure:

<html>
    <body>
        Hello World
    </body>
</html>

Could be parsed (an validated) by this grammar + Postlexer:

from typing import Iterator

from lark import Lark, Token

TEXT = r"""
<html>
    <body>
        Hello World
    </body>
</html>
"""

GRAMMAR = r"""
start: node

node: OPEN_TAG content* CLOSE_TAG
content: node
       | TEXT 

TEXT: /[^\s<>]+/
RAW_OPEN: "<" /\w+/ ">"
RAW_CLOSE: "</" /\w+/ ">"

%ignore WS

%import common.WS

%declare OPEN_TAG CLOSE_TAG
"""


class MatchTag:
    always_accept = "RAW_OPEN", "RAW_CLOSE"

    def process(self, stream: Iterator[Token]) -> Iterator[Token]:
        stack = []
        for t in stream:
            if t.type == "RAW_OPEN":
                stack.append(t)
                t.type = "OPEN_TAG"
            elif t.type == "RAW_CLOSE":
                open_tag = stack.pop()
                if open_tag.value[1:-1] != t.value[2:-1]:
                    raise ValueError(f"Non matching closing tag (expected {open_tag.value!r}, got {t.value!r})")
                t.type = "CLOSE_TAG"
            yield t


parser = Lark(GRAMMAR, parser='lalr', postlex=MatchTag())

print(parser.parse(TEXT).pretty())

(Note: Don't use Lark if you actually want to parse XML. There are a lot of pitfalls that are hard to impossible to deal with)

answered Aug 17, 2021 at 20:23

MegaIng

7,9542 gold badges24 silver badges39 bronze badges

3 Comments

user766308 Over a year ago

Thanks for the answer and introducing me to postlexers! Fortunately, I don't have to deal with nested tags. The difficulty with heredoc is to find the closing delimiter - it could be anything, since it depends on the current string. And it's not delimited itself. But it looks like your solution might work for this question: stackoverflow.com/questions/65277844/…

MegaIng Over a year ago

@user766308 I already commented on that answer. The questions also contains 'self closing' tags that look no different from normal opening tags and also attributes. That makes that question quite a bit harder.

user766308 Over a year ago

Thanks for the clarification, I did not notice the wrinkle about self closing tags there.

Collectives™ on Stack Overflow

Lark matching custom delimiter multiline strings

2 Answers 2

Comments

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related