Skip to content

Commit c9d69ab

Browse files
committed
Part 12 article
1 parent 8e1773f commit c9d69ab

1 file changed

Lines changed: 289 additions & 0 deletions

File tree

_parts/part12.md

Lines changed: 289 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,289 @@
1+
---
2+
title: Part 12 - Scanning a Multi-Level B-Tree
3+
date: 2017-11-11
4+
---
5+
6+
We now support constructing a multi-level btree, but we've broken `select` statements in the process. Here's a test case that inserts 15 rows and then tries to print them.
7+
8+
```diff
9+
+ it 'prints all rows in a multi-level tree' do
10+
+ script = []
11+
+ (1..15).each do |i|
12+
+ script << "insert #{i} user#{i} person#{i}@example.com"
13+
+ end
14+
+ script << "select"
15+
+ script << ".exit"
16+
+ result = run_script(script)
17+
+
18+
+ expect(result[15...result.length]).to eq([
19+
+ "db > (1, user1, person1@example.com)",
20+
+ "(2, user2, person2@example.com)",
21+
+ "(3, user3, person3@example.com)",
22+
+ "(4, user4, person4@example.com)",
23+
+ "(5, user5, person5@example.com)",
24+
+ "(6, user6, person6@example.com)",
25+
+ "(7, user7, person7@example.com)",
26+
+ "(8, user8, person8@example.com)",
27+
+ "(9, user9, person9@example.com)",
28+
+ "(10, user10, person10@example.com)",
29+
+ "(11, user11, person11@example.com)",
30+
+ "(12, user12, person12@example.com)",
31+
+ "(13, user13, person13@example.com)",
32+
+ "(14, user14, person14@example.com)",
33+
+ "(15, user15, person15@example.com)",
34+
+ "Executed.", "db > ",
35+
+ ])
36+
+ end
37+
```
38+
39+
But when we run that test case right now, what actually happens is:
40+
41+
```
42+
db > select
43+
(2, user1, person1@example.com)
44+
Executed.
45+
```
46+
47+
That's weird. It's only printing one row, and that row looks corrupted (notice the id doesn't match the username).
48+
49+
The weirdness is because `execute_select()` begins at the start of the table, and our current implementation of `table_start()` returns cell 0 of the root node. But the root of our tree is now an internal node which doesn't contain any rows. The data that was printed must have been left over from when the root node was a leaf. `execute_select()` should really return cell 0 of the leftmost leaf node.
50+
51+
So get rid of the old implementation:
52+
53+
```diff
54+
-Cursor* table_start(Table* table) {
55+
- Cursor* cursor = malloc(sizeof(Cursor));
56+
- cursor->table = table;
57+
- cursor->page_num = table->root_page_num;
58+
- cursor->cell_num = 0;
59+
-
60+
- void* root_node = get_page(table->pager, table->root_page_num);
61+
- uint32_t num_cells = *leaf_node_num_cells(root_node);
62+
- cursor->end_of_table = (num_cells == 0);
63+
-
64+
- return cursor;
65+
-}
66+
```
67+
68+
And add a new implementation that searches for key 0 (the minimum possible key). Even if key 0 does not exist in the table, this method will return the position of the lowest id (the start of the left-most leaf node).
69+
70+
```diff
71+
+Cursor* table_start(Table* table) {
72+
+ Cursor* cursor = table_find(table, 0);
73+
+
74+
+ void* node = get_page(table->pager, cursor->page_num);
75+
+ uint32_t num_cells = *leaf_node_num_cells(node);
76+
+ cursor->end_of_table = (num_cells == 0);
77+
+
78+
+ return cursor;
79+
+}
80+
```
81+
82+
With those changes, it still only prints out one node's worth of rows:
83+
84+
```
85+
db > select
86+
(1, user1, person1@example.com)
87+
(2, user2, person2@example.com)
88+
(3, user3, person3@example.com)
89+
(4, user4, person4@example.com)
90+
(5, user5, person5@example.com)
91+
(6, user6, person6@example.com)
92+
(7, user7, person7@example.com)
93+
Executed.
94+
db >
95+
```
96+
97+
With 15 entries, our btree consists of one internal node and two leaf nodes, which looks something like this:
98+
99+
{% include image.html url="assets/images/btree3.png" description="structure of our btree" %}
100+
101+
To scan the entire table, we need to jump to the second leaf node after we reach the end of the first. To do that, we're going to save a new field in the leaf node header called "next_leaf", which will hold the page number of the leaf's sibling node on the right. The rightmost leaf node will have a `next_leaf` value of 0 to denote no sibling (page 0 is reserved for the root node of the table anyway).
102+
103+
Update the leaf node header format to include the new field:
104+
105+
```diff
106+
const uint32_t LEAF_NODE_NUM_CELLS_SIZE = sizeof(uint32_t);
107+
const uint32_t LEAF_NODE_NUM_CELLS_OFFSET = COMMON_NODE_HEADER_SIZE;
108+
-const uint32_t LEAF_NODE_HEADER_SIZE =
109+
- COMMON_NODE_HEADER_SIZE + LEAF_NODE_NUM_CELLS_SIZE;
110+
+const uint32_t LEAF_NODE_NEXT_LEAF_SIZE = sizeof(uint32_t);
111+
+const uint32_t LEAF_NODE_NEXT_LEAF_OFFSET =
112+
+ LEAF_NODE_NUM_CELLS_OFFSET + LEAF_NODE_NUM_CELLS_SIZE;
113+
+const uint32_t LEAF_NODE_HEADER_SIZE = COMMON_NODE_HEADER_SIZE +
114+
+ LEAF_NODE_NUM_CELLS_SIZE +
115+
+ LEAF_NODE_NEXT_LEAF_SIZE;
116+
117+
```
118+
119+
Add a method to access the new field:
120+
```diff
121+
+uint32_t* leaf_node_next_leaf(void* node) {
122+
+ return node + LEAF_NODE_NEXT_LEAF_OFFSET;
123+
+}
124+
```
125+
126+
Set `next_leaf` to 0 by default when initializing a new leaf node:
127+
128+
```diff
129+
@@ -322,6 +330,7 @@ void initialize_leaf_node(void* node) {
130+
set_node_type(node, NODE_LEAF);
131+
set_node_root(node, false);
132+
*leaf_node_num_cells(node) = 0;
133+
+ *leaf_node_next_leaf(node) = 0; // 0 represents no sibling
134+
}
135+
```
136+
137+
Whenever we split a leaf node, update the sibling pointers. The old leaf's sibling becomes the new leaf, and the new leaf's sibling becomes whatever used to be the old leaf's sibling.
138+
139+
```diff
140+
@@ -659,6 +671,8 @@ void leaf_node_split_and_insert(Cursor* cursor, uint32_t key, Row* value) {
141+
uint32_t new_page_num = get_unused_page_num(cursor->table->pager);
142+
void* new_node = get_page(cursor->table->pager, new_page_num);
143+
initialize_leaf_node(new_node);
144+
+ *leaf_node_next_leaf(new_node) = *leaf_node_next_leaf(old_node);
145+
+ *leaf_node_next_leaf(old_node) = new_page_num;
146+
```
147+
148+
Adding a new field changes a few constants:
149+
```diff
150+
it 'prints constants' do
151+
script = [
152+
".constants",
153+
@@ -199,9 +228,9 @@ describe 'database' do
154+
"db > Constants:",
155+
"ROW_SIZE: 293",
156+
"COMMON_NODE_HEADER_SIZE: 6",
157+
- "LEAF_NODE_HEADER_SIZE: 10",
158+
+ "LEAF_NODE_HEADER_SIZE: 14",
159+
"LEAF_NODE_CELL_SIZE: 297",
160+
- "LEAF_NODE_SPACE_FOR_CELLS: 4086",
161+
+ "LEAF_NODE_SPACE_FOR_CELLS: 4082",
162+
"LEAF_NODE_MAX_CELLS: 13",
163+
"db > ",
164+
])
165+
```
166+
167+
Now whenever we want to advance the cursor past the end of a leaf node, we can check if the leaf node has a sibling. If it does, jump to it. Otherwise, we're at the end of the table.
168+
169+
```diff
170+
@@ -428,7 +432,15 @@ void cursor_advance(Cursor* cursor) {
171+
172+
cursor->cell_num += 1;
173+
if (cursor->cell_num >= (*leaf_node_num_cells(node))) {
174+
- cursor->end_of_table = true;
175+
+ /* Advance to next leaf node */
176+
+ uint32_t next_page_num = *leaf_node_next_leaf(node);
177+
+ if (next_page_num == 0) {
178+
+ /* This was rightmost leaf */
179+
+ cursor->end_of_table = true;
180+
+ } else {
181+
+ cursor->page_num = next_page_num;
182+
+ cursor->cell_num = 0;
183+
+ }
184+
}
185+
}
186+
```
187+
188+
After those changes, we actually print 15 rows...
189+
```
190+
db > select
191+
(1, user1, person1@example.com)
192+
(2, user2, person2@example.com)
193+
(3, user3, person3@example.com)
194+
(4, user4, person4@example.com)
195+
(5, user5, person5@example.com)
196+
(6, user6, person6@example.com)
197+
(7, user7, person7@example.com)
198+
(8, user8, person8@example.com)
199+
(9, user9, person9@example.com)
200+
(10, user10, person10@example.com)
201+
(11, user11, person11@example.com)
202+
(12, user12, person12@example.com)
203+
(13, user13, person13@example.com)
204+
(1919251317, 14, on14@example.com)
205+
(15, user15, person15@example.com)
206+
Executed.
207+
db >
208+
```
209+
210+
...but one of them looks corrupted
211+
```
212+
(1919251317, 14, on14@example.com)
213+
```
214+
215+
After some debugging, I found out it's because of a bug in how we split leaf nodes:
216+
217+
```diff
218+
@@ -676,7 +690,9 @@ void leaf_node_split_and_insert(Cursor* cursor, uint32_t key, Row* value) {
219+
void* destination = leaf_node_cell(destination_node, index_within_node);
220+
221+
if (i == cursor->cell_num) {
222+
- serialize_row(value, destination);
223+
+ serialize_row(value,
224+
+ leaf_node_value(destination_node, index_within_node));
225+
+ *leaf_node_key(destination_node, index_within_node) = key;
226+
} else if (i > cursor->cell_num) {
227+
memcpy(destination, leaf_node_cell(old_node, i - 1), LEAF_NODE_CELL_SIZE);
228+
} else {
229+
```
230+
231+
Remember that each cell in a leaf node consists of first a key then a value:
232+
233+
{% include image.html url="assets/images/leaf-node-format.png" description="Original leaf node format" %}
234+
235+
We were writing the new row (value) into the start of the cell, where the key should go. That means part of the username was going into the section for id (hence the crazy large id).
236+
237+
After fixing that bug, we finally print out the entire table as expected:
238+
239+
```
240+
db > select
241+
(1, user1, person1@example.com)
242+
(2, user2, person2@example.com)
243+
(3, user3, person3@example.com)
244+
(4, user4, person4@example.com)
245+
(5, user5, person5@example.com)
246+
(6, user6, person6@example.com)
247+
(7, user7, person7@example.com)
248+
(8, user8, person8@example.com)
249+
(9, user9, person9@example.com)
250+
(10, user10, person10@example.com)
251+
(11, user11, person11@example.com)
252+
(12, user12, person12@example.com)
253+
(13, user13, person13@example.com)
254+
(14, user14, person14@example.com)
255+
(15, user15, person15@example.com)
256+
Executed.
257+
db >
258+
```
259+
260+
Whew! One bug after another, but we're making progress. Now that we've got the sibling pointer, I don't think we actually need a parent pointer. I added it preemptively, but we never actually used it.
261+
262+
```diff
263+
const uint32_t NODE_TYPE_OFFSET = 0;
264+
const uint32_t IS_ROOT_SIZE = sizeof(uint8_t);
265+
const uint32_t IS_ROOT_OFFSET = NODE_TYPE_SIZE;
266+
-const uint32_t PARENT_POINTER_SIZE = sizeof(uint32_t);
267+
-const uint32_t PARENT_POINTER_OFFSET = IS_ROOT_OFFSET + IS_ROOT_SIZE;
268+
const uint8_t COMMON_NODE_HEADER_SIZE =
269+
- NODE_TYPE_SIZE + IS_ROOT_SIZE + PARENT_POINTER_SIZE;
270+
+ NODE_TYPE_SIZE + IS_ROOT_SIZE;
271+
```
272+
273+
```diff
274+
expect(result).to eq([
275+
"db > Constants:",
276+
"ROW_SIZE: 293",
277+
- "COMMON_NODE_HEADER_SIZE: 6",
278+
- "LEAF_NODE_HEADER_SIZE: 14",
279+
+ "COMMON_NODE_HEADER_SIZE: 2",
280+
+ "LEAF_NODE_HEADER_SIZE: 10",
281+
"LEAF_NODE_CELL_SIZE: 297",
282+
- "LEAF_NODE_SPACE_FOR_CELLS: 4082",
283+
+ "LEAF_NODE_SPACE_FOR_CELLS: 4086",
284+
"LEAF_NODE_MAX_CELLS: 13",
285+
"db > ",
286+
])
287+
```
288+
289+
Until next time.

0 commit comments

Comments
 (0)