|
| 1 | +.. _compiler: |
| 2 | + |
| 3 | +The Compiler |
| 4 | +============ |
| 5 | + |
| 6 | +The compilation process in MicroPython involves the following steps: |
| 7 | + |
| 8 | +* The lexer converts the stream of text that makes up a MicroPython program into tokens. |
| 9 | +* The parser then converts the tokens into an abstract syntax (parse tree). |
| 10 | +* Then bytecode or native code is emitted based on the parse tree. |
| 11 | + |
| 12 | +For purposes of this discussion we are going to add a simple language feature ``add1`` |
| 13 | +that can be use in Python as: |
| 14 | + |
| 15 | +.. code-block:: bash |
| 16 | +
|
| 17 | + >>> add1 3 |
| 18 | + 4 |
| 19 | + >>> |
| 20 | +
|
| 21 | +The ``add1`` statement takes an integer as argument and adds ``1`` to it. |
| 22 | + |
| 23 | +Adding a grammar rule |
| 24 | +---------------------- |
| 25 | + |
| 26 | +MicroPython's grammar is based on the `CPython grammar <https://docs.python.org/3.5/reference/grammar.html>`_ |
| 27 | +and is defined in `py/grammar.h <https://github.com/micropython/micropython/blob/master/py/grammar.h>`_. |
| 28 | +This grammar is what is used to parse MicroPython source files. |
| 29 | + |
| 30 | +There are two macros you need to know to define a grammar rule: ``DEF_RULE`` and ``DEF_RULE_NC``. |
| 31 | +``DEF_RULE`` allows you to define a rule with an associated compile function, |
| 32 | +while ``DEF_RULE_NC`` has no compile (NC) function for it. |
| 33 | + |
| 34 | +A simple grammar definition with a compile function for our new ``add1`` statement |
| 35 | +looks like the following: |
| 36 | + |
| 37 | +.. code-block:: c |
| 38 | +
|
| 39 | + DEF_RULE(add1_stmt, c(add1_stmt), and(2), tok(KW_ADD1), rule(testlist)) |
| 40 | +
|
| 41 | +The second argument ``c(add1_stmt)`` is the corresponding compile function that should be implemented |
| 42 | +in ``py/compile.c`` to turn this rule into executable code. |
| 43 | + |
| 44 | +The third required argument can be ``or`` or ``and``. This specifies the number of nodes associated |
| 45 | +with a statement. For example, in this case, our ``add1`` statement is similar to ADD1 in assembly |
| 46 | +language. It takes one numeric argument. Therefore, the ``add1_stmt`` has two nodes associated with it. |
| 47 | +One node is for the statement itself, i.e the literal ``add1`` corresponding to ``KW_ADD1``, |
| 48 | +and the other for its argument, a ``testlist`` rule which is the top-level expression rule. |
| 49 | + |
| 50 | +.. note:: |
| 51 | + The ``add1`` rule here is just an example and not part of the standard |
| 52 | + MicroPython grammar. |
| 53 | + |
| 54 | +The fourth argument in this example is the token associated with the rule, ``KW_ADD1``. This token should be |
| 55 | +defined in the lexer by editing ``py/lexer.h``. |
| 56 | + |
| 57 | +Defining the same rule without a compile function is achieved by using the ``DEF_RULE_NC`` macro |
| 58 | +and omitting the compile function argument: |
| 59 | + |
| 60 | +.. code-block:: c |
| 61 | +
|
| 62 | + DEF_RULE_NC(add1_stmt, and(2), tok(KW_ADD1), rule(testlist)) |
| 63 | +
|
| 64 | +The remaining arguments take on the same meaning. A rule without a compile function must |
| 65 | +be handled explicitly by all rules that may have this rule as a node. Such NC-rules are usually |
| 66 | +used to express sub-parts of a complicated grammar structure that cannot be expressed in a |
| 67 | +single rule. |
| 68 | + |
| 69 | +.. note:: |
| 70 | + The macros ``DEF_RULE`` and ``DEF_RULE_NC`` take other arguments. For an in-depth understanding of |
| 71 | + supported parameters, see `py/grammar.h <https://github.com/micropython/micropython/blob/master/py/grammar.h>`_. |
| 72 | + |
| 73 | +Adding a lexical token |
| 74 | +---------------------- |
| 75 | + |
| 76 | +Every rule defined in the grammar should have a token associated with it that is defined in ``py/lexer.h``. |
| 77 | +Add this token by editing the ``_mp_token_kind_t`` enum: |
| 78 | + |
| 79 | +.. code-block:: c |
| 80 | + :emphasize-lines: 12 |
| 81 | +
|
| 82 | + typedef enum _mp_token_kind_t { |
| 83 | + ... |
| 84 | + MP_TOKEN_KW_OR, |
| 85 | + MP_TOKEN_KW_PASS, |
| 86 | + MP_TOKEN_KW_RAISE, |
| 87 | + MP_TOKEN_KW_RETURN, |
| 88 | + MP_TOKEN_KW_TRY, |
| 89 | + MP_TOKEN_KW_WHILE, |
| 90 | + MP_TOKEN_KW_WITH, |
| 91 | + MP_TOKEN_KW_YIELD, |
| 92 | + MP_TOKEN_KW_ADD1, |
| 93 | + ... |
| 94 | + } mp_token_kind_t; |
| 95 | +
|
| 96 | +Then also edit ``py/lexer.c`` to add the new keyword literal text: |
| 97 | + |
| 98 | +.. code-block:: c |
| 99 | + :emphasize-lines: 12 |
| 100 | +
|
| 101 | + STATIC const char *const tok_kw[] = { |
| 102 | + ... |
| 103 | + "or", |
| 104 | + "pass", |
| 105 | + "raise", |
| 106 | + "return", |
| 107 | + "try", |
| 108 | + "while", |
| 109 | + "with", |
| 110 | + "yield", |
| 111 | + "add1", |
| 112 | + ... |
| 113 | + }; |
| 114 | +
|
| 115 | +Notice the keyword is named depending on what you want it to be. For consistency, maintain the |
| 116 | +naming standard accordingly. |
| 117 | + |
| 118 | +.. note:: |
| 119 | + The order of these keywords in ``py/lexer.c`` must match the order of tokens in the enum |
| 120 | + defined in ``py/lexer.h``. |
| 121 | + |
| 122 | +Parsing |
| 123 | +------- |
| 124 | + |
| 125 | +In the parsing stage the parser takes the tokens produced by the lexer and converts them to an abstract syntax tree (AST) or |
| 126 | +*parse tree*. The implementation for the parser is defined in `py/parse.c <https://github.com/micropython/micropython/blob/master/py/parse.c>`_. |
| 127 | + |
| 128 | +The parser also maintains a table of constants for use in different aspects of parsing, similar to what a |
| 129 | +`symbol table <https://steemit.com/programming/@drifter1/writing-a-simple-compiler-on-my-own-symbol-table-basic-structure>`_ |
| 130 | +does. |
| 131 | + |
| 132 | +Several optimizations like `constant folding <http://compileroptimizations.com/category/constant_folding.htm>`_ |
| 133 | +on integers for most operations e.g. logical, binary, unary, etc, and optimizing enhancements on parenthesis |
| 134 | +around expressions are performed during this phase, along with some optimizations on strings. |
| 135 | + |
| 136 | +It's worth noting that *docstrings* are discarded and not accessible to the compiler. |
| 137 | +Even optimizations like `string interning <https://en.wikipedia.org/wiki/String_interning>`_ are |
| 138 | +not applied to *docstrings*. |
| 139 | + |
| 140 | +Compiler passes |
| 141 | +--------------- |
| 142 | + |
| 143 | +Like many compilers, MicroPython compiles all code to MicroPython bytecode or native code. The functionality |
| 144 | +that achieves this is implemented in `py/compile.c <https://github.com/micropython/micropython/blob/master/py/compile.c>`_. |
| 145 | +The most relevant method you should know about is this: |
| 146 | + |
| 147 | +.. code-block:: c |
| 148 | +
|
| 149 | + mp_obj_t mp_compile(mp_parse_tree_t *parse_tree, qstr source_file, bool is_repl) { |
| 150 | + // Compile the input parse_tree to a raw-code structure. |
| 151 | + mp_raw_code_t *rc = mp_compile_to_raw_code(parse_tree, source_file, is_repl); |
| 152 | + // Create and return a function object that executes the outer module. |
| 153 | + return mp_make_function_from_raw_code(rc, MP_OBJ_NULL, MP_OBJ_NULL); |
| 154 | + } |
| 155 | +
|
| 156 | +The compiler compiles the code in four passes: scope, stack size, code size and emit. |
| 157 | +Each pass runs the same C code over the same AST data structure, with different things |
| 158 | +being computed each time based on the results of the previous pass. |
| 159 | + |
| 160 | +First pass |
| 161 | +~~~~~~~~~~ |
| 162 | + |
| 163 | +In the first pass, the compiler learns about the known identifiers (variables) and |
| 164 | +their scope, being global, local, closed over, etc. In the same pass the emitter |
| 165 | +(bytecode or native code) also computes the number of labels needed for the emitted |
| 166 | +code. |
| 167 | + |
| 168 | +.. code-block:: c |
| 169 | +
|
| 170 | + // Compile pass 1. |
| 171 | + comp->emit = emit_bc; |
| 172 | + comp->emit_method_table = &emit_bc_method_table; |
| 173 | +
|
| 174 | + uint max_num_labels = 0; |
| 175 | + for (scope_t *s = comp->scope_head; s != NULL && comp->compile_error == MP_OBJ_NULL; s = s->next) { |
| 176 | + if (s->emit_options == MP_EMIT_OPT_ASM) { |
| 177 | + compile_scope_inline_asm(comp, s, MP_PASS_SCOPE); |
| 178 | + } else { |
| 179 | + compile_scope(comp, s, MP_PASS_SCOPE); |
| 180 | +
|
| 181 | + // Check if any implicitly declared variables should be closed over. |
| 182 | + for (size_t i = 0; i < s->id_info_len; ++i) { |
| 183 | + id_info_t *id = &s->id_info[i]; |
| 184 | + if (id->kind == ID_INFO_KIND_GLOBAL_IMPLICIT) { |
| 185 | + scope_check_to_close_over(s, id); |
| 186 | + } |
| 187 | + } |
| 188 | + } |
| 189 | + ... |
| 190 | + } |
| 191 | +
|
| 192 | +Second and third passes |
| 193 | +~~~~~~~~~~~~~~~~~~~~~~~ |
| 194 | + |
| 195 | +The second and third passes involve computing the Python stack size and code size |
| 196 | +for the bytecode or native code. After the third pass the code size cannot change, |
| 197 | +otherwise jump labels will be incorrect. |
| 198 | + |
| 199 | +.. code-block:: c |
| 200 | +
|
| 201 | + for (scope_t *s = comp->scope_head; s != NULL && comp->compile_error == MP_OBJ_NULL; s = s->next) { |
| 202 | + ... |
| 203 | +
|
| 204 | + // Pass 2: Compute the Python stack size. |
| 205 | + compile_scope(comp, s, MP_PASS_STACK_SIZE); |
| 206 | +
|
| 207 | + // Pass 3: Compute the code size. |
| 208 | + if (comp->compile_error == MP_OBJ_NULL) { |
| 209 | + compile_scope(comp, s, MP_PASS_CODE_SIZE); |
| 210 | + } |
| 211 | +
|
| 212 | + ... |
| 213 | + } |
| 214 | +
|
| 215 | +Just before pass two there is a selection for the type of code to be emitted, which can |
| 216 | +either be native or bytecode. |
| 217 | + |
| 218 | +.. code-block:: c |
| 219 | +
|
| 220 | + // Choose the emitter type. |
| 221 | + switch (s->emit_options) { |
| 222 | + case MP_EMIT_OPT_NATIVE_PYTHON: |
| 223 | + case MP_EMIT_OPT_VIPER: |
| 224 | + if (emit_native == NULL) { |
| 225 | + emit_native = NATIVE_EMITTER(new)(&comp->compile_error, &comp->next_label, max_num_labels); |
| 226 | + } |
| 227 | + comp->emit_method_table = NATIVE_EMITTER_TABLE; |
| 228 | + comp->emit = emit_native; |
| 229 | + break; |
| 230 | +
|
| 231 | + default: |
| 232 | + comp->emit = emit_bc; |
| 233 | + comp->emit_method_table = &emit_bc_method_table; |
| 234 | + break; |
| 235 | + } |
| 236 | +
|
| 237 | +The bytecode option is the default but something unique to note for the native |
| 238 | +code option is that there is another option via ``VIPER``. See the |
| 239 | +:ref:`Emitting native code <emitting_native_code>` section for more details on |
| 240 | +viper annotations. |
| 241 | + |
| 242 | +There is also support for *inline assembly code*, where assembly instructions are |
| 243 | +written as Python function calls but are emitted directly as the corresponding |
| 244 | +machine code. This assembler has only three passes (scope, code size, emit) |
| 245 | +and uses a different implementation, not the ``compile_scope`` function. |
| 246 | +See the `inline assembler tutorial <https://docs.micropython.org/en/latest/pyboard/tutorial/assembler.html#pyboard-tutorial-assembler>`_ |
| 247 | +for more details. |
| 248 | + |
| 249 | +Fourth pass |
| 250 | +~~~~~~~~~~~ |
| 251 | + |
| 252 | +The fourth pass emits the final code that can be executed, either bytecode in |
| 253 | +the virtual machine, or native code directly by the CPU. |
| 254 | + |
| 255 | +.. code-block:: c |
| 256 | +
|
| 257 | + for (scope_t *s = comp->scope_head; s != NULL && comp->compile_error == MP_OBJ_NULL; s = s->next) { |
| 258 | + ... |
| 259 | +
|
| 260 | + // Pass 4: Emit the compiled bytecode or native code. |
| 261 | + if (comp->compile_error == MP_OBJ_NULL) { |
| 262 | + compile_scope(comp, s, MP_PASS_EMIT); |
| 263 | + } |
| 264 | + } |
| 265 | +
|
| 266 | +Emitting bytecode |
| 267 | +----------------- |
| 268 | + |
| 269 | +Statements in Python code usually correspond to emitted bytecode, for example ``a + b`` |
| 270 | +generates "push a" then "push b" then "binary op add". Some statements do not emit |
| 271 | +anything but instead affect other things like the scope of variables, for example |
| 272 | +``global a``. |
| 273 | + |
| 274 | +The implementation of a function that emits bytecode looks similar to this: |
| 275 | + |
| 276 | +.. code-block:: c |
| 277 | +
|
| 278 | + void mp_emit_bc_unary_op(emit_t *emit, mp_unary_op_t op) { |
| 279 | + emit_write_bytecode_byte(emit, 0, MP_BC_UNARY_OP_MULTI + op); |
| 280 | + } |
| 281 | +
|
| 282 | +We use the unary operator expressions for an example here but the implementation |
| 283 | +details are similar for other statements/expressions. The method ``emit_write_bytecode_byte()`` |
| 284 | +is a wrapper around the main function ``emit_get_cur_to_write_bytecode()`` that all |
| 285 | +functions must call to emit bytecode. |
| 286 | + |
| 287 | +.. _emitting_native_code: |
| 288 | + |
| 289 | +Emitting native code |
| 290 | +--------------------- |
| 291 | + |
| 292 | +Similar to how bytecode is generated, there should be a corresponding function in ``py/emitnative.c`` for each |
| 293 | +code statement: |
| 294 | + |
| 295 | +.. code-block:: c |
| 296 | +
|
| 297 | + STATIC void emit_native_unary_op(emit_t *emit, mp_unary_op_t op) { |
| 298 | + vtype_kind_t vtype; |
| 299 | + emit_pre_pop_reg(emit, &vtype, REG_ARG_2); |
| 300 | + if (vtype == VTYPE_PYOBJ) { |
| 301 | + emit_call_with_imm_arg(emit, MP_F_UNARY_OP, op, REG_ARG_1); |
| 302 | + emit_post_push_reg(emit, VTYPE_PYOBJ, REG_RET); |
| 303 | + } else { |
| 304 | + adjust_stack(emit, 1); |
| 305 | + EMIT_NATIVE_VIPER_TYPE_ERROR(emit, |
| 306 | + MP_ERROR_TEXT("unary op %q not implemented"), mp_unary_op_method_name[op]); |
| 307 | + } |
| 308 | + } |
| 309 | +
|
| 310 | +The difference here is that we have to handle *viper typing*. Viper annotations allow |
| 311 | +us to handle more than one type of variable. By default all variables are Python objects, |
| 312 | +but with viper a variable can also be declared as a machine-typed variable like a native |
| 313 | +integer or pointer. Viper can be thought of as a superset of Python, where normal Python |
| 314 | +objects are handled as usual, while native machine variables are handled in an optimised |
| 315 | +way by using direct machine instructions for the operations. Viper typing may break |
| 316 | +Python equivalence because, for example, integers become native integers and can overflow |
| 317 | +(unlike Python integers which extend automatically to arbitrary precision). |
0 commit comments