1

I am trying to use lark to extract some information from perl files. For that, I need a basic understanding of what a statement is. The issue I came across are "Here Document" strings. I would describe them as multiline strings with custom delimiters, like:

$my_var .= << 'anydelim';
some things
other things
anydelim

While writing down this question, I figured out a solution using a regex with backreferences / named references. Since I could not find any similar question, I decided to post the question and answer it myself.

If anyone knows any other method (like a way to use back references across multiple lark rules), please let me know!

2 Answers 2

1

A solution using a regexp. Key ingredients:

  • back references, in this case named references
  • the /s modifier (causes . to also match newlines
  • .*? to match non greedy (otherwise it would also consume the delimiter)
from lark import Lark

block_grammar = r"""
    %import common.WS
    %ignore WS
    delimited_string: "<<" /(?P<quote>['"])(?P<delimiter>[A-Za-z_]+)(?P=quote)\;.*?(?P=delimiter)/s
"""
minimal_parser = Lark(block_grammar, start="delimited_string")

ast = minimal_parser.parse(r"""
    << 'SomeDelim'; fasdfasdf 
    fddfsdg SomeDelim
""")
print(ast.pretty())

Sign up to request clarification or add additional context in comments.

Comments

1

If you want to do complex back references across multiple Terminals, e.g. you can't use a single regex, you need to use a PostLexer (or worst case, a Custom lexer). A Small example with a XML like structure:

<html>
    <body>
        Hello World
    </body>
</html>

Could be parsed (an validated) by this grammar + Postlexer:

from typing import Iterator

from lark import Lark, Token

TEXT = r"""
<html>
    <body>
        Hello World
    </body>
</html>
"""

GRAMMAR = r"""
start: node

node: OPEN_TAG content* CLOSE_TAG
content: node
       | TEXT 

TEXT: /[^\s<>]+/
RAW_OPEN: "<" /\w+/ ">"
RAW_CLOSE: "</" /\w+/ ">"

%ignore WS

%import common.WS

%declare OPEN_TAG CLOSE_TAG
"""


class MatchTag:
    always_accept = "RAW_OPEN", "RAW_CLOSE"

    def process(self, stream: Iterator[Token]) -> Iterator[Token]:
        stack = []
        for t in stream:
            if t.type == "RAW_OPEN":
                stack.append(t)
                t.type = "OPEN_TAG"
            elif t.type == "RAW_CLOSE":
                open_tag = stack.pop()
                if open_tag.value[1:-1] != t.value[2:-1]:
                    raise ValueError(f"Non matching closing tag (expected {open_tag.value!r}, got {t.value!r})")
                t.type = "CLOSE_TAG"
            yield t


parser = Lark(GRAMMAR, parser='lalr', postlex=MatchTag())

print(parser.parse(TEXT).pretty())

(Note: Don't use Lark if you actually want to parse XML. There are a lot of pitfalls that are hard to impossible to deal with)

3 Comments

Thanks for the answer and introducing me to postlexers! Fortunately, I don't have to deal with nested tags. The difficulty with heredoc is to find the closing delimiter - it could be anything, since it depends on the current string. And it's not delimited itself. But it looks like your solution might work for this question: stackoverflow.com/questions/65277844/…
@user766308 I already commented on that answer. The questions also contains 'self closing' tags that look no different from normal opening tags and also attributes. That makes that question quite a bit harder.
Thanks for the clarification, I did not notice the wrinkle about self closing tags there.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.