Skip to content

Commit 7fd4f29

Browse files
authored
Evictions made robust (#128)
1 parent df6900f commit 7fd4f29

9 files changed

Lines changed: 753 additions & 298 deletions

File tree

docs/memory_management.md

Lines changed: 125 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,125 @@
1+
# Memory Management
2+
3+
## Overview
4+
5+
Endee does not keep every index resident all the time. `IndexManager` keeps a bounded set of
6+
live indices in memory in `indices_`, and uses `indices_list_` to choose eviction candidates when
7+
that live set is full.
8+
9+
When an index is created or loaded on demand, `ensureLiveIndexCapacity()` may call
10+
`evictIfNeeded()` before admitting the new live index.
11+
12+
In the current implementation, a live index consists of:
13+
14+
- the in-memory `HierarchicalNSW<float>` graph
15+
- `IDMapper`
16+
- `VectorStorage`, which itself owns separate MDBX-backed stores for vectors, metadata, and filters
17+
- optional `SparseVectorStorage`
18+
- the per-index WAL object
19+
20+
## What Actually Uses DRAM
21+
22+
The dominant in-memory cost of a live dense index is the HNSW structure:
23+
24+
- the base layer, allocated as `maxElements * sizeDataAtBaseLayer_`
25+
- upper-layer node storage in `dataUpperLayer_`
26+
- the vector cache, sized from `VECTOR_CACHE_PERCENTAGE` and `VECTOR_CACHE_MIN_BITS`
27+
- the visited-list pool and other small bookkeeping structures
28+
29+
One important detail: Endee does not load the full dense vector corpus into the HNSW object.
30+
Dense vectors stay in `VectorStorage` and are fetched on demand through the vector fetcher and the
31+
vector cache. So the main DRAM cost is the graph plus cache, not a second full copy of the vector
32+
database.
33+
34+
## Scaling
35+
36+
### 1. Virtual Address Space
37+
38+
Each live dense index opens multiple MDBX environments, each with a large configured upper bound:
39+
40+
| Component | Max map size |
41+
| --- | --- |
42+
| `IDMapper` | 8 GiB |
43+
| dense vector store | 4 TiB |
44+
| dense metadata store | 512 GiB |
45+
| filter store | 64 GiB |
46+
| sparse storage | 1 TiB |
47+
48+
These are the default configured maxima. In `settings.hpp`, both the initial/current map size and
49+
the maximum map size are runtime-configurable through environment variables such as:
50+
`NDD_INDEX_META_MAP_SIZE_BITS`, `NDD_INDEX_META_MAP_SIZE_MAX_BITS`,
51+
`NDD_ID_MAPPER_MAP_SIZE_BITS`, `NDD_ID_MAPPER_MAP_SIZE_MAX_BITS`,
52+
`NDD_FILTER_MAP_SIZE_BITS`, `NDD_FILTER_MAP_SIZE_MAX_BITS`,
53+
`NDD_METADATA_MAP_SIZE_BITS`, `NDD_METADATA_MAP_SIZE_MAX_BITS`,
54+
`NDD_VECTOR_MAP_SIZE_BITS`, `NDD_VECTOR_MAP_SIZE_MAX_BITS`, and
55+
`NDD_SPARSE_MAP_SIZE_MAX_BITS`.
56+
57+
That is about 5.57 TiB of configured MDBX map capacity for a live index with sparse storage
58+
enabled. For a dense-only index, the total is about 4.57 TiB. There is also one global metadata
59+
environment for index metadata with a 128 MiB upper bound.
60+
61+
Because these environments stay open while the index is live, virtual address space becomes a
62+
scaling constraint, especially when many indices are resident at once.
63+
64+
### 2. Server DRAM
65+
66+
Each live index allocates its HNSW graph structures eagerly when the index is created or loaded.
67+
That memory must fit in RAM for the server to stay healthy.
68+
69+
### 3. Sticky-Thread MDBX Environments and `PTHREAD_KEYS_MAX`
70+
71+
All dense-index MDBX environments are currently opened without `MDBX_NOSTICKYTHREADS`. The only
72+
place that enables `MDBX_NOSTICKYTHREADS` today is sparse storage.
73+
74+
That means a live dense index currently opens four sticky-thread MDBX environments:
75+
76+
- `IDMapper`
77+
- dense vector store
78+
- dense metadata store
79+
- filter store
80+
81+
There is also one global sticky-thread environment in `MetadataManager`.
82+
83+
If libmdbx consumes one pthread TLS key per sticky environment, the current constant
84+
`MAX_LIVE_INDICES = 255` is consistent with the code layout:
85+
86+
- `255 * 4 = 1020` per-index sticky environments
87+
- `+1` global metadata environment
88+
- total `1021`, which stays just below a `PTHREAD_KEYS_MAX` of `1024`
89+
90+
On glibc-based systems, `PTHREAD_KEYS_MAX` is a libc build-time constant, so increasing it would
91+
require rebuilding glibc.
92+
93+
## How Eviction Works Today
94+
95+
`evictIfNeeded()` is currently a live-index-count guard. It runs when:
96+
97+
- `createIndex()` is about to create a new live index
98+
- `getIndexEntry()` needs to load a cold index from disk
99+
100+
When eviction runs, it:
101+
102+
1. picks the candidate at the back of `indices_list_`
103+
2. saves the index first if it is dirty
104+
3. marks the cache entry invalid
105+
4. removes it from `indices_`
106+
107+
One subtle but important detail: this is not yet a true inactivity-based or LRU policy.
108+
`indices_list_` is updated on create/load, but not refreshed on search or mutation, and
109+
`last_access` is currently not used anywhere in the eviction path.
110+
111+
In practice, that means eviction is closer to "oldest loaded/created live index first" than
112+
"least recently used index first".
113+
114+
## TODO
115+
116+
1. There is no DRAM-based admission or eviction policy yet. `MAX_ANON_MEM` is only a commented
117+
placeholder. In practice, the usable memory ceiling should be computed at startup from the
118+
effective deployment limit:
119+
cgroup limits, container limits, and host/server memory limits. The way this is discovered will
120+
also be OS-specific, with different logic needed for Linux and macOS.
121+
2. `MAX_LIVE_INDICES` is a fixed compile-time constant. A better implementation would derive a
122+
safe cap from the actual runtime environment, for example by checking the system's
123+
`PTHREAD_KEYS_MAX` value via `getconf PTHREAD_KEYS_MAX`.
124+
3. The server does not currently refuse startup when the machine or container cannot satisfy the
125+
minimum memory footprint needed to keep the required live indices healthy.

0 commit comments

Comments
 (0)