LALR Grammar for transforming text to csv

Question

I have a processor trace output that has the following format:

Time    Cycle   PC  Instr   Decoded instruction Register and memory contents
    905ns              86 00000e36 00a005b3 c.add            x11,  x0, x10       x11=00000e5c x10:00000e5c
    915ns              87 00000e38 00000693 c.addi           x13,  x0, 0         x13=00000000
    925ns              88 00000e3a 00000613 c.addi           x12,  x0, 0         x12=00000000
    935ns              89 00000e3c 00000513 c.addi           x10,  x0, 0         x10=00000000
    945ns              90 00000e3e 2b40006f c.jal             x0, 692           
    975ns              93 000010f2 0d01a703 lw               x14, 208(x3)        x14=00002b20  x3:00003288  PA:00003358
    985ns              94 000010f6 00a00333 c.add             x6,  x0, x10        x6=00000000 x10:00000000
    995ns              95 000010f8 14872783 lw               x15, 328(x14)       x15=00000000 x14:00002b20  PA:00002c68
   1015ns              97 000010fc 00079563 c.bne            x15,  x0, 10        x15:00000000

Allegedly, this is \t separated, however this is not the case, as inline spaces are found here and there. I want to transform this into a .csv format with a header row and the entries following. For example:

Time,Cycle,PC,Instr,Decoded instruction,Register and memory contents
905ns,86,00000e36,00a005b3,"c.add x11, x0, x10", x11=00000e5c x10:00000e5c
915ns,87,00000e38,00000693,"c.addi x13, x0, 0", x13=00000000
...

To do that, I am using Lark in python3 (>=3.10). And I came up with the following grammar for the source format:

Lark Grammar

start: header NEWLINE entries+

# Header is expected to be 
# Time\tCycle\tPC\tInstr\tDecoded instruction\tRegister and memory contents
header: HEADER_FIELD+
         

# Entries are expected to be e.g.,
#     85ns               4 00000180 00003197 auipc             x3, 0x3000          x3=00003180
entries: TIME                \
         CYCLE               \
         PC                  \
         INSTR               \
         DECODED_INSTRUCTION \
         reg_and_mem? NEWLINE

reg_and_mem: REG_AND_MEM+ 

///////////////
// TERMINALS //
///////////////

HEADER_FIELD: /
    [a-z ]+  # Characters that are optionally separated by a single space
/xi          

TIME: /
    [\d\.]+    # One or more digits
    [smunp]s   # Time unit
/x

CYCLE: INT

PC: HEXDIGIT+

INSTR: HEXDIGIT+

DECODED_INSTRUCTION: /
    [a-z\.]+             # Instruction mnemonic
    ([-a-z0-9, ()]+)?    # Optional operand part (rd,rs1,rs2, etc.)      
    (?=                  # Stop when 
        x[0-9]{1,2}[=:]  # Either you hit an xN= or xN:
        |PA:             # or you meet PA:
        |\s+$            # or there is no REG_AND_MEM and you meet a \n
    )
/xi


REG_AND_MEM: /
    (?:[x[0-9]+|PA)
    [=|:]
    [0-9a-f]+
/xi

///////////////
// IMPORTS   //
///////////////

%import common.HEXDIGIT
%import common.NUMBER
%import common.INT
%import common.UCASE_LETTER
%import common.CNAME
%import common.NUMBER
%import common.WS_INLINE
%import common.WS
%import common.NEWLINE

///////////////
// IGNORE    //
///////////////

%ignore WS_INLINE

Here is my simple driver code:

import lark


class TraceTransformer(lark.Transformer):

    def start(self, args):
        return lark.Discard

    def header(self, fields):

        return [str(field) for field in fields]

    def entries(self, args):
        print(args)
        ...

                               # the grammar provided above
                               # stored in the same directory
                               # as this file
parser = lark.Lark(grammar=open("grammar.lark").read(),
                   start="start",
                   parser="lalr",
                   transformer=TraceTransformer())

# This is parsed by the grammar without problems
# Note that I omit from  the  c.addi the operand
# part and its still parsed. This is ok as  some
# mnemonics do not have operands  (e.g., fence).
dummy_text_ok1 = r"""Time    Cycle   PC  Instr   Decoded instruction Register and memory contents
    905ns              86 00000e36 00a005b3 c.add            x11,  x0, x10       x11=00000e5c x10:00000e5c
    915ns              87 00000e38 00000693 c.addi           x13,  x0, 0         x13=00000000
    925ns              88 00000e3a 00000613 c.addi                  x12=00000000
    935ns              89 00000e3c 00000513 c.addi           x10,  x0, 0         x10=00000000"""

# Now here starts trouble. Note that here we don't
# have a REG_AND_MEM part on the jump instruction.
# However this is still parsed with no errors.
dummy_text_ok2 = r"""Time    Cycle   PC  Instr   Decoded instruction Register and memory
945ns              90 00000e3e 2b40006f c.jal             x0, 692
"""

# But here, when the parser meets the line of cjal
# where there is no REG_AND_MEM part and a  follow
# up entry exists we have an issue.
dummy_text_problematic = r"""Time    Cycle   PC  Instr   Decoded instruction Register and memory contents
    905ns              86 00000e36 00a005b3 c.add            x11,  x0, x10       x11=00000e5c x10:00000e5c
    915ns              87 00000e38 00000693 c.addi           x13,  x0, 0         x13=00000000
    925ns              88 00000e3a 00000613 c.addi           x12,  x0, 0         x12=00000000
    935ns              89 00000e3c 00000513 c.addi           x10,  x0, 0         x10=00000000
    945ns              90 00000e3e 2b40006f c.jal             x0, 692           
    975ns              93 000010f2 0d01a703 lw               x14, 208(x3)        x14=00002b20  x3:00003288  PA:00003358
    985ns              94 000010f6 00a00333 c.add             x6,  x0, x10        x6=00000000 x10:00000000
    995ns              95 000010f8 14872783 lw               x15, 328(x14)       x15=00000000 x14:00002b20  PA:00002c68
   1015ns              97 000010fc 00079563 c.bne            x15,  x0, 10        x15:00000000
"""

parser.parse(dummy_text_ok1) 
parser.parse(dummy_text_ok2)
parser.parse(dummy_text_problematic)

The Runtime Error

No terminal matches 'c' in the current parser context, at line 6 col 45

945ns              90 00000e3e 2b40006f c.jal             x0, 692                                        
                                         ^
Expected one of:
        * DECODED_INSTRUCTION

So this indicates that the DECODED_INSTRUCTION rule is not behaving as expected.

The Rule

DECODED_INSTRUCTION: /
    [a-z\.]+             # Instruction mnemonic
    ([-a-z0-9, ()]+)?    # Optional operand part (rd,rs1,rs2, etc.)      
    (?=                  # Stop when 
        x[0-9]{1,2}[=:]  # Either you hit an xN= or xN:
        |PA:             # or you meet PA:
        |\s+$            # or there is no REG_AND_MEM and you meet a \n
    )
/xi

This rule is really heavy, it has to match the whole ISA of the processor, which is in RISC-V btw. So here step-by-step I have

The instruction mnemonic regex as a sequence of a-z characters and optional dots (.)
The optional operand part (there exist instructions in the ISA with no operands).

Now, this was tricky. Instead of accounting from every possible instruction variation in my rules above, I thought to leverage the fact that there exist characters in the following column (Register and memory contents) which do not exist in any instruction variation of the ISA. This is where the look-ahead part of the regex comes in place. I stop when

Either I have reached the xN= part or the xN: part of the field
Either I have reached the PA: part of the field
OR I have reached the end of the line ($) as the field does not exist.

However, the last case does not seem to work as intended, as shown in the above example. The way I see it, this seems OK to either stop when you meet one of the two criteria, OR you have encountered a new line (implying that the following part is omitted for the current entry). Did I blunder something in the regex part?

I don't know about this, but wouldn't you need to use the m (multiline) flag to make $ account for the line-end rather than the text-end? Further you might need to add a word boundary, see this regex101 demo. Will remove comment, if I totally missed the point :D — bobble bubble
– bobble bubble, Commented Oct 5, 2024 at 9:43
Also try to replace |\s+$ with |[\t ]+\r?\n|\Z (regex101, no m-flag needed here). — bobble bubble
– bobble bubble, Commented Oct 5, 2024 at 9:58
Indeed I missed the m flag. But after also adding the \b at the instruction mnemonic, the above example still fails. Regarding the second response, this raises regex compilation issues in Lark. .LexError: Cannot compile token. I believe that is because we are in eXtended x mode. — ex1led
– ex1led, Commented Oct 5, 2024 at 10:00
Or try |\s+\r?\n|\Z instead of |\s+$ just to test if that would work (regex101). — bobble bubble
– bobble bubble, Commented Oct 5, 2024 at 10:07
So \s+\r?\n|\Z doesn't work. BUT, \s+\r\n|\Z does work. But I am not really sure whether this is correct. I mean the \r I agree that should be optional. Thats weird. — ex1led
– ex1led, Commented Oct 5, 2024 at 10:09

MegaIng · Accepted Answer · 2024-10-05 09:43:26Z

1

For $ to mean end-of-line, you need to add the m, i.e. MULTILINE flag

DECODED_INSTRUCTION: /
    ...
/xim

answered Oct 5, 2024 at 9:43

MegaIng

7,9542 gold badges24 silver badges39 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

ex1led Over a year ago

Hi there. Indeed I have missed that. But still after adding that it still complains about the same 'problematic' line with the same error.

MegaIng Over a year ago

@ex1led It works perfectly for me.

ex1led Over a year ago

In Python 3.11.9 and lark==1.2.2 the grammar with the m flag added doesn't work, i.e., the problematic line is triggering an error on the DECODED_INSTRUCTION rule.

MegaIng Over a year ago

@ex1led Are you sure you are running the exact code and grammar you posted in the question? The only mdofication I needed to make was to add a newline at the of one of the two text_oks and add the m flag, and then it works (i.e. parses without an error message)

ex1led Over a year ago

Yea...thats weird. Would you mind pasting your code e.g., in a gist?

|

basit · Accepted Answer · 2024-10-05 12:51:09Z

0

The issue seems with the DECODED_INSTRUCTION rule's lookahead which is not properly handling cases where the Register and memory contents part is missing. To ensure that DECODED_INSTRUCTION stops before the newline when Register and memory contents part is missing, you should adjust the lookahead to include the end of the line (\n) or the end of the input ($). This will prevent the instruction from unintentionally consuming parts of the next row. updated DECODED_INSTRUCTION rule:

DECODED_INSTRUCTION: /
[a-z\.]+             # Instruction mnemonic
([-a-z0-9, ()]+)?    # Optional operand part (rd,rs1,rs2, etc.)      
(?=                  # Stop when 
    x[0-9]{1,2}[=:]  # Either you hit an xN= or xN:
    | PA:             # or you meet PA:
    | \n              # or it's the end of the line
    | $               # or the end of input
)

/xi

answered Oct 5, 2024 at 12:51

basit

112 bronze badges

1 Comment

ex1led Over a year ago

Have you tested this code with the MRWE that I provided?

Collectives™ on Stack Overflow

LALR Grammar for transforming text to csv

Lark Grammar

The Runtime Error

The Rule

2 Answers 2

6 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

Lark Grammar

The Runtime Error

The Rule

2 Answers 2

6 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related