Skip to content

Commit 4eaebc1

Browse files
nanjekyejoannahdpgeorge
authored andcommitted
docs/develop: Add MicroPython Internals chapter.
This commit adds many new sections to the existing "Developing and building MicroPython" chapter to make it all about the internals of MicroPython. This work was done as part of Google's Season of Docs 2020.
1 parent 203e1d2 commit 4eaebc1

File tree

15 files changed

+1454
-11
lines changed

15 files changed

+1454
-11
lines changed

docs/develop/compiler.rst

Lines changed: 317 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,317 @@
1+
.. _compiler:
2+
3+
The Compiler
4+
============
5+
6+
The compilation process in MicroPython involves the following steps:
7+
8+
* The lexer converts the stream of text that makes up a MicroPython program into tokens.
9+
* The parser then converts the tokens into an abstract syntax (parse tree).
10+
* Then bytecode or native code is emitted based on the parse tree.
11+
12+
For purposes of this discussion we are going to add a simple language feature ``add1``
13+
that can be use in Python as:
14+
15+
.. code-block:: bash
16+
17+
>>> add1 3
18+
4
19+
>>>
20+
21+
The ``add1`` statement takes an integer as argument and adds ``1`` to it.
22+
23+
Adding a grammar rule
24+
----------------------
25+
26+
MicroPython's grammar is based on the `CPython grammar <https://docs.python.org/3.5/reference/grammar.html>`_
27+
and is defined in `py/grammar.h <https://github.com/micropython/micropython/blob/master/py/grammar.h>`_.
28+
This grammar is what is used to parse MicroPython source files.
29+
30+
There are two macros you need to know to define a grammar rule: ``DEF_RULE`` and ``DEF_RULE_NC``.
31+
``DEF_RULE`` allows you to define a rule with an associated compile function,
32+
while ``DEF_RULE_NC`` has no compile (NC) function for it.
33+
34+
A simple grammar definition with a compile function for our new ``add1`` statement
35+
looks like the following:
36+
37+
.. code-block:: c
38+
39+
DEF_RULE(add1_stmt, c(add1_stmt), and(2), tok(KW_ADD1), rule(testlist))
40+
41+
The second argument ``c(add1_stmt)`` is the corresponding compile function that should be implemented
42+
in ``py/compile.c`` to turn this rule into executable code.
43+
44+
The third required argument can be ``or`` or ``and``. This specifies the number of nodes associated
45+
with a statement. For example, in this case, our ``add1`` statement is similar to ADD1 in assembly
46+
language. It takes one numeric argument. Therefore, the ``add1_stmt`` has two nodes associated with it.
47+
One node is for the statement itself, i.e the literal ``add1`` corresponding to ``KW_ADD1``,
48+
and the other for its argument, a ``testlist`` rule which is the top-level expression rule.
49+
50+
.. note::
51+
The ``add1`` rule here is just an example and not part of the standard
52+
MicroPython grammar.
53+
54+
The fourth argument in this example is the token associated with the rule, ``KW_ADD1``. This token should be
55+
defined in the lexer by editing ``py/lexer.h``.
56+
57+
Defining the same rule without a compile function is achieved by using the ``DEF_RULE_NC`` macro
58+
and omitting the compile function argument:
59+
60+
.. code-block:: c
61+
62+
DEF_RULE_NC(add1_stmt, and(2), tok(KW_ADD1), rule(testlist))
63+
64+
The remaining arguments take on the same meaning. A rule without a compile function must
65+
be handled explicitly by all rules that may have this rule as a node. Such NC-rules are usually
66+
used to express sub-parts of a complicated grammar structure that cannot be expressed in a
67+
single rule.
68+
69+
.. note::
70+
The macros ``DEF_RULE`` and ``DEF_RULE_NC`` take other arguments. For an in-depth understanding of
71+
supported parameters, see `py/grammar.h <https://github.com/micropython/micropython/blob/master/py/grammar.h>`_.
72+
73+
Adding a lexical token
74+
----------------------
75+
76+
Every rule defined in the grammar should have a token associated with it that is defined in ``py/lexer.h``.
77+
Add this token by editing the ``_mp_token_kind_t`` enum:
78+
79+
.. code-block:: c
80+
:emphasize-lines: 12
81+
82+
typedef enum _mp_token_kind_t {
83+
...
84+
MP_TOKEN_KW_OR,
85+
MP_TOKEN_KW_PASS,
86+
MP_TOKEN_KW_RAISE,
87+
MP_TOKEN_KW_RETURN,
88+
MP_TOKEN_KW_TRY,
89+
MP_TOKEN_KW_WHILE,
90+
MP_TOKEN_KW_WITH,
91+
MP_TOKEN_KW_YIELD,
92+
MP_TOKEN_KW_ADD1,
93+
...
94+
} mp_token_kind_t;
95+
96+
Then also edit ``py/lexer.c`` to add the new keyword literal text:
97+
98+
.. code-block:: c
99+
:emphasize-lines: 12
100+
101+
STATIC const char *const tok_kw[] = {
102+
...
103+
"or",
104+
"pass",
105+
"raise",
106+
"return",
107+
"try",
108+
"while",
109+
"with",
110+
"yield",
111+
"add1",
112+
...
113+
};
114+
115+
Notice the keyword is named depending on what you want it to be. For consistency, maintain the
116+
naming standard accordingly.
117+
118+
.. note::
119+
The order of these keywords in ``py/lexer.c`` must match the order of tokens in the enum
120+
defined in ``py/lexer.h``.
121+
122+
Parsing
123+
-------
124+
125+
In the parsing stage the parser takes the tokens produced by the lexer and converts them to an abstract syntax tree (AST) or
126+
*parse tree*. The implementation for the parser is defined in `py/parse.c <https://github.com/micropython/micropython/blob/master/py/parse.c>`_.
127+
128+
The parser also maintains a table of constants for use in different aspects of parsing, similar to what a
129+
`symbol table <https://steemit.com/programming/@drifter1/writing-a-simple-compiler-on-my-own-symbol-table-basic-structure>`_
130+
does.
131+
132+
Several optimizations like `constant folding <http://compileroptimizations.com/category/constant_folding.htm>`_
133+
on integers for most operations e.g. logical, binary, unary, etc, and optimizing enhancements on parenthesis
134+
around expressions are performed during this phase, along with some optimizations on strings.
135+
136+
It's worth noting that *docstrings* are discarded and not accessible to the compiler.
137+
Even optimizations like `string interning <https://en.wikipedia.org/wiki/String_interning>`_ are
138+
not applied to *docstrings*.
139+
140+
Compiler passes
141+
---------------
142+
143+
Like many compilers, MicroPython compiles all code to MicroPython bytecode or native code. The functionality
144+
that achieves this is implemented in `py/compile.c <https://github.com/micropython/micropython/blob/master/py/compile.c>`_.
145+
The most relevant method you should know about is this:
146+
147+
.. code-block:: c
148+
149+
mp_obj_t mp_compile(mp_parse_tree_t *parse_tree, qstr source_file, bool is_repl) {
150+
// Compile the input parse_tree to a raw-code structure.
151+
mp_raw_code_t *rc = mp_compile_to_raw_code(parse_tree, source_file, is_repl);
152+
// Create and return a function object that executes the outer module.
153+
return mp_make_function_from_raw_code(rc, MP_OBJ_NULL, MP_OBJ_NULL);
154+
}
155+
156+
The compiler compiles the code in four passes: scope, stack size, code size and emit.
157+
Each pass runs the same C code over the same AST data structure, with different things
158+
being computed each time based on the results of the previous pass.
159+
160+
First pass
161+
~~~~~~~~~~
162+
163+
In the first pass, the compiler learns about the known identifiers (variables) and
164+
their scope, being global, local, closed over, etc. In the same pass the emitter
165+
(bytecode or native code) also computes the number of labels needed for the emitted
166+
code.
167+
168+
.. code-block:: c
169+
170+
// Compile pass 1.
171+
comp->emit = emit_bc;
172+
comp->emit_method_table = &emit_bc_method_table;
173+
174+
uint max_num_labels = 0;
175+
for (scope_t *s = comp->scope_head; s != NULL && comp->compile_error == MP_OBJ_NULL; s = s->next) {
176+
if (s->emit_options == MP_EMIT_OPT_ASM) {
177+
compile_scope_inline_asm(comp, s, MP_PASS_SCOPE);
178+
} else {
179+
compile_scope(comp, s, MP_PASS_SCOPE);
180+
181+
// Check if any implicitly declared variables should be closed over.
182+
for (size_t i = 0; i < s->id_info_len; ++i) {
183+
id_info_t *id = &s->id_info[i];
184+
if (id->kind == ID_INFO_KIND_GLOBAL_IMPLICIT) {
185+
scope_check_to_close_over(s, id);
186+
}
187+
}
188+
}
189+
...
190+
}
191+
192+
Second and third passes
193+
~~~~~~~~~~~~~~~~~~~~~~~
194+
195+
The second and third passes involve computing the Python stack size and code size
196+
for the bytecode or native code. After the third pass the code size cannot change,
197+
otherwise jump labels will be incorrect.
198+
199+
.. code-block:: c
200+
201+
for (scope_t *s = comp->scope_head; s != NULL && comp->compile_error == MP_OBJ_NULL; s = s->next) {
202+
...
203+
204+
// Pass 2: Compute the Python stack size.
205+
compile_scope(comp, s, MP_PASS_STACK_SIZE);
206+
207+
// Pass 3: Compute the code size.
208+
if (comp->compile_error == MP_OBJ_NULL) {
209+
compile_scope(comp, s, MP_PASS_CODE_SIZE);
210+
}
211+
212+
...
213+
}
214+
215+
Just before pass two there is a selection for the type of code to be emitted, which can
216+
either be native or bytecode.
217+
218+
.. code-block:: c
219+
220+
// Choose the emitter type.
221+
switch (s->emit_options) {
222+
case MP_EMIT_OPT_NATIVE_PYTHON:
223+
case MP_EMIT_OPT_VIPER:
224+
if (emit_native == NULL) {
225+
emit_native = NATIVE_EMITTER(new)(&comp->compile_error, &comp->next_label, max_num_labels);
226+
}
227+
comp->emit_method_table = NATIVE_EMITTER_TABLE;
228+
comp->emit = emit_native;
229+
break;
230+
231+
default:
232+
comp->emit = emit_bc;
233+
comp->emit_method_table = &emit_bc_method_table;
234+
break;
235+
}
236+
237+
The bytecode option is the default but something unique to note for the native
238+
code option is that there is another option via ``VIPER``. See the
239+
:ref:`Emitting native code <emitting_native_code>` section for more details on
240+
viper annotations.
241+
242+
There is also support for *inline assembly code*, where assembly instructions are
243+
written as Python function calls but are emitted directly as the corresponding
244+
machine code. This assembler has only three passes (scope, code size, emit)
245+
and uses a different implementation, not the ``compile_scope`` function.
246+
See the `inline assembler tutorial <https://docs.micropython.org/en/latest/pyboard/tutorial/assembler.html#pyboard-tutorial-assembler>`_
247+
for more details.
248+
249+
Fourth pass
250+
~~~~~~~~~~~
251+
252+
The fourth pass emits the final code that can be executed, either bytecode in
253+
the virtual machine, or native code directly by the CPU.
254+
255+
.. code-block:: c
256+
257+
for (scope_t *s = comp->scope_head; s != NULL && comp->compile_error == MP_OBJ_NULL; s = s->next) {
258+
...
259+
260+
// Pass 4: Emit the compiled bytecode or native code.
261+
if (comp->compile_error == MP_OBJ_NULL) {
262+
compile_scope(comp, s, MP_PASS_EMIT);
263+
}
264+
}
265+
266+
Emitting bytecode
267+
-----------------
268+
269+
Statements in Python code usually correspond to emitted bytecode, for example ``a + b``
270+
generates "push a" then "push b" then "binary op add". Some statements do not emit
271+
anything but instead affect other things like the scope of variables, for example
272+
``global a``.
273+
274+
The implementation of a function that emits bytecode looks similar to this:
275+
276+
.. code-block:: c
277+
278+
void mp_emit_bc_unary_op(emit_t *emit, mp_unary_op_t op) {
279+
emit_write_bytecode_byte(emit, 0, MP_BC_UNARY_OP_MULTI + op);
280+
}
281+
282+
We use the unary operator expressions for an example here but the implementation
283+
details are similar for other statements/expressions. The method ``emit_write_bytecode_byte()``
284+
is a wrapper around the main function ``emit_get_cur_to_write_bytecode()`` that all
285+
functions must call to emit bytecode.
286+
287+
.. _emitting_native_code:
288+
289+
Emitting native code
290+
---------------------
291+
292+
Similar to how bytecode is generated, there should be a corresponding function in ``py/emitnative.c`` for each
293+
code statement:
294+
295+
.. code-block:: c
296+
297+
STATIC void emit_native_unary_op(emit_t *emit, mp_unary_op_t op) {
298+
vtype_kind_t vtype;
299+
emit_pre_pop_reg(emit, &vtype, REG_ARG_2);
300+
if (vtype == VTYPE_PYOBJ) {
301+
emit_call_with_imm_arg(emit, MP_F_UNARY_OP, op, REG_ARG_1);
302+
emit_post_push_reg(emit, VTYPE_PYOBJ, REG_RET);
303+
} else {
304+
adjust_stack(emit, 1);
305+
EMIT_NATIVE_VIPER_TYPE_ERROR(emit,
306+
MP_ERROR_TEXT("unary op %q not implemented"), mp_unary_op_method_name[op]);
307+
}
308+
}
309+
310+
The difference here is that we have to handle *viper typing*. Viper annotations allow
311+
us to handle more than one type of variable. By default all variables are Python objects,
312+
but with viper a variable can also be declared as a machine-typed variable like a native
313+
integer or pointer. Viper can be thought of as a superset of Python, where normal Python
314+
objects are handled as usual, while native machine variables are handled in an optimised
315+
way by using direct machine instructions for the operations. Viper typing may break
316+
Python equivalence because, for example, integers become native integers and can overflow
317+
(unlike Python integers which extend automatically to arbitrary precision).
Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
.. _extendingmicropython:
2+
3+
Extending MicroPython in C
4+
==========================
5+
6+
This chapter describes options for implementing additional functionality in C, but from code
7+
written outside of the main MicroPython repository. The first approach is useful for building
8+
your own custom firmware with some project-specific additional modules or functions that can
9+
be accessed from Python. The second approach is for building modules that can be loaded at runtime.
10+
11+
Please see the :ref:`library section <internals_library>` for more information on building core modules that
12+
live in the main MicroPython repository.
13+
14+
.. toctree::
15+
:maxdepth: 3
16+
17+
cmodules.rst
18+
natmod.rst
19+

0 commit comments

Comments
 (0)