2

I have a processor trace output that has the following format:

Time    Cycle   PC  Instr   Decoded instruction Register and memory contents
    905ns              86 00000e36 00a005b3 c.add            x11,  x0, x10       x11=00000e5c x10:00000e5c
    915ns              87 00000e38 00000693 c.addi           x13,  x0, 0         x13=00000000
    925ns              88 00000e3a 00000613 c.addi           x12,  x0, 0         x12=00000000
    935ns              89 00000e3c 00000513 c.addi           x10,  x0, 0         x10=00000000
    945ns              90 00000e3e 2b40006f c.jal             x0, 692           
    975ns              93 000010f2 0d01a703 lw               x14, 208(x3)        x14=00002b20  x3:00003288  PA:00003358
    985ns              94 000010f6 00a00333 c.add             x6,  x0, x10        x6=00000000 x10:00000000
    995ns              95 000010f8 14872783 lw               x15, 328(x14)       x15=00000000 x14:00002b20  PA:00002c68
   1015ns              97 000010fc 00079563 c.bne            x15,  x0, 10        x15:00000000

Allegedly, this is \t separated, however this is not the case, as inline spaces are found here and there. I want to transform this into a .csv format with a header row and the entries following. For example:

Time,Cycle,PC,Instr,Decoded instruction,Register and memory contents
905ns,86,00000e36,00a005b3,"c.add x11, x0, x10", x11=00000e5c x10:00000e5c
915ns,87,00000e38,00000693,"c.addi x13, x0, 0", x13=00000000
...

To do that, I am using Lark in python3 (>=3.10). And I came up with the following grammar for the source format:

Lark Grammar

start: header NEWLINE entries+

# Header is expected to be 
# Time\tCycle\tPC\tInstr\tDecoded instruction\tRegister and memory contents
header: HEADER_FIELD+
         

# Entries are expected to be e.g.,
#     85ns               4 00000180 00003197 auipc             x3, 0x3000          x3=00003180
entries: TIME                \
         CYCLE               \
         PC                  \
         INSTR               \
         DECODED_INSTRUCTION \
         reg_and_mem? NEWLINE

reg_and_mem: REG_AND_MEM+ 

///////////////
// TERMINALS //
///////////////

HEADER_FIELD: /
    [a-z ]+  # Characters that are optionally separated by a single space
/xi          

TIME: /
    [\d\.]+    # One or more digits
    [smunp]s   # Time unit
/x

CYCLE: INT

PC: HEXDIGIT+

INSTR: HEXDIGIT+

DECODED_INSTRUCTION: /
    [a-z\.]+             # Instruction mnemonic
    ([-a-z0-9, ()]+)?    # Optional operand part (rd,rs1,rs2, etc.)      
    (?=                  # Stop when 
        x[0-9]{1,2}[=:]  # Either you hit an xN= or xN:
        |PA:             # or you meet PA:
        |\s+$            # or there is no REG_AND_MEM and you meet a \n
    )
/xi


REG_AND_MEM: /
    (?:[x[0-9]+|PA)
    [=|:]
    [0-9a-f]+
/xi

///////////////
// IMPORTS   //
///////////////

%import common.HEXDIGIT
%import common.NUMBER
%import common.INT
%import common.UCASE_LETTER
%import common.CNAME
%import common.NUMBER
%import common.WS_INLINE
%import common.WS
%import common.NEWLINE

///////////////
// IGNORE    //
///////////////

%ignore WS_INLINE

Here is my simple driver code:

import lark


class TraceTransformer(lark.Transformer):

    def start(self, args):
        return lark.Discard

    def header(self, fields):

        return [str(field) for field in fields]

    def entries(self, args):
        print(args)
        ...

                               # the grammar provided above
                               # stored in the same directory
                               # as this file
parser = lark.Lark(grammar=open("grammar.lark").read(),
                   start="start",
                   parser="lalr",
                   transformer=TraceTransformer())

# This is parsed by the grammar without problems
# Note that I omit from  the  c.addi the operand
# part and its still parsed. This is ok as  some
# mnemonics do not have operands  (e.g., fence).
dummy_text_ok1 = r"""Time    Cycle   PC  Instr   Decoded instruction Register and memory contents
    905ns              86 00000e36 00a005b3 c.add            x11,  x0, x10       x11=00000e5c x10:00000e5c
    915ns              87 00000e38 00000693 c.addi           x13,  x0, 0         x13=00000000
    925ns              88 00000e3a 00000613 c.addi                  x12=00000000
    935ns              89 00000e3c 00000513 c.addi           x10,  x0, 0         x10=00000000"""

# Now here starts trouble. Note that here we don't
# have a REG_AND_MEM part on the jump instruction.
# However this is still parsed with no errors.
dummy_text_ok2 = r"""Time    Cycle   PC  Instr   Decoded instruction Register and memory
945ns              90 00000e3e 2b40006f c.jal             x0, 692
"""

# But here, when the parser meets the line of cjal
# where there is no REG_AND_MEM part and a  follow
# up entry exists we have an issue.
dummy_text_problematic = r"""Time    Cycle   PC  Instr   Decoded instruction Register and memory contents
    905ns              86 00000e36 00a005b3 c.add            x11,  x0, x10       x11=00000e5c x10:00000e5c
    915ns              87 00000e38 00000693 c.addi           x13,  x0, 0         x13=00000000
    925ns              88 00000e3a 00000613 c.addi           x12,  x0, 0         x12=00000000
    935ns              89 00000e3c 00000513 c.addi           x10,  x0, 0         x10=00000000
    945ns              90 00000e3e 2b40006f c.jal             x0, 692           
    975ns              93 000010f2 0d01a703 lw               x14, 208(x3)        x14=00002b20  x3:00003288  PA:00003358
    985ns              94 000010f6 00a00333 c.add             x6,  x0, x10        x6=00000000 x10:00000000
    995ns              95 000010f8 14872783 lw               x15, 328(x14)       x15=00000000 x14:00002b20  PA:00002c68
   1015ns              97 000010fc 00079563 c.bne            x15,  x0, 10        x15:00000000
"""

parser.parse(dummy_text_ok1) 
parser.parse(dummy_text_ok2)
parser.parse(dummy_text_problematic) 

The Runtime Error

No terminal matches 'c' in the current parser context, at line 6 col 45

945ns              90 00000e3e 2b40006f c.jal             x0, 692                                        
                                         ^
Expected one of:
        * DECODED_INSTRUCTION

So this indicates that the DECODED_INSTRUCTION rule is not behaving as expected.

The Rule

DECODED_INSTRUCTION: /
    [a-z\.]+             # Instruction mnemonic
    ([-a-z0-9, ()]+)?    # Optional operand part (rd,rs1,rs2, etc.)      
    (?=                  # Stop when 
        x[0-9]{1,2}[=:]  # Either you hit an xN= or xN:
        |PA:             # or you meet PA:
        |\s+$            # or there is no REG_AND_MEM and you meet a \n
    )
/xi

This rule is really heavy, it has to match the whole ISA of the processor, which is in RISC-V btw. So here step-by-step I have

  • The instruction mnemonic regex as a sequence of a-z characters and optional dots (.)
  • The optional operand part (there exist instructions in the ISA with no operands).

Now, this was tricky. Instead of accounting from every possible instruction variation in my rules above, I thought to leverage the fact that there exist characters in the following column (Register and memory contents) which do not exist in any instruction variation of the ISA. This is where the look-ahead part of the regex comes in place. I stop when

  • Either I have reached the xN= part or the xN: part of the field
  • Either I have reached the PA: part of the field
  • OR I have reached the end of the line ($) as the field does not exist.

However, the last case does not seem to work as intended, as shown in the above example. The way I see it, this seems OK to either stop when you meet one of the two criteria, OR you have encountered a new line (implying that the following part is omitted for the current entry). Did I blunder something in the regex part?

5
  • 1
    I don't know about this, but wouldn't you need to use the m (multiline) flag to make $ account for the line-end rather than the text-end? Further you might need to add a word boundary, see this regex101 demo. Will remove comment, if I totally missed the point :D Commented Oct 5, 2024 at 9:43
  • Also try to replace |\s+$ with |[\t ]+\r?\n|\Z (regex101, no m-flag needed here). Commented Oct 5, 2024 at 9:58
  • Indeed I missed the m flag. But after also adding the \b at the instruction mnemonic, the above example still fails. Regarding the second response, this raises regex compilation issues in Lark. .LexError: Cannot compile token. I believe that is because we are in eXtended x mode. Commented Oct 5, 2024 at 10:00
  • 1
    Or try |\s+\r?\n|\Z instead of |\s+$ just to test if that would work (regex101). Commented Oct 5, 2024 at 10:07
  • 1
    So \s+\r?\n|\Z doesn't work. BUT, \s+\r\n|\Z does work. But I am not really sure whether this is correct. I mean the \r I agree that should be optional. Thats weird. Commented Oct 5, 2024 at 10:09

2 Answers 2

1

For $ to mean end-of-line, you need to add the m, i.e. MULTILINE flag

DECODED_INSTRUCTION: /
    ...
/xim
Sign up to request clarification or add additional context in comments.

6 Comments

Hi there. Indeed I have missed that. But still after adding that it still complains about the same 'problematic' line with the same error.
@ex1led It works perfectly for me.
In Python 3.11.9 and lark==1.2.2 the grammar with the m flag added doesn't work, i.e., the problematic line is triggering an error on the DECODED_INSTRUCTION rule.
@ex1led Are you sure you are running the exact code and grammar you posted in the question? The only mdofication I needed to make was to add a newline at the of one of the two text_oks and add the m flag, and then it works (i.e. parses without an error message)
Yea...thats weird. Would you mind pasting your code e.g., in a gist?
|
0

The issue seems with the DECODED_INSTRUCTION rule's lookahead which is not properly handling cases where the Register and memory contents part is missing. To ensure that DECODED_INSTRUCTION stops before the newline when Register and memory contents part is missing, you should adjust the lookahead to include the end of the line (\n) or the end of the input ($). This will prevent the instruction from unintentionally consuming parts of the next row. updated DECODED_INSTRUCTION rule:

DECODED_INSTRUCTION: /
[a-z\.]+             # Instruction mnemonic
([-a-z0-9, ()]+)?    # Optional operand part (rd,rs1,rs2, etc.)      
(?=                  # Stop when 
    x[0-9]{1,2}[=:]  # Either you hit an xN= or xN:
    | PA:             # or you meet PA:
    | \n              # or it's the end of the line
    | $               # or the end of input
)

/xi

1 Comment

Have you tested this code with the MRWE that I provided?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.