Add Memento provider#43
Open
veerps57 wants to merge 1 commit into
Open
Conversation
Adds the `memento` memory provider that exercises Memento (https://github.com/veerps57/memento) — a local-first, MCP-native memory layer for AI assistants. The provider spawns `memento serve` as a stdio MCP subprocess and routes ingest, search, and clear through MCP tool calls. Memento is designed to store **distilled assertions, not transcripts**. In production the calling AI assistant uses its own LLM to decide what's worth remembering, then hands those candidates to Memento's `extract_memory` MCP tool, which embeds, scrubs, dedups, and persists. To faithfully represent that flow inside the bench (which only hands the provider raw `UnifiedSession` transcripts), this provider performs the same distillation step itself — calling the configured LLM per session and passing the resulting candidates to `extract_memory`. Per-question isolation uses Memento's `workspace` scope keyed by the benchmark's `containerTag` (Memento's `session.id` requires a 26-char ULID, while `containerTag` is an arbitrary string). One DB, one server, many scopes. Provider config (env): MEMENTO_BIN shell-like command for `memento serve` (default: "npx -y @psraghuveer/memento") MEMENTO_BENCH_DB SQLite path (default: /tmp/memento-bench-<ts>.db) MEMENTO_DISTILL_MODEL LLM alias for distillation (defaults to memorybench's answering model) MEMENTO_BENCH_SEARCH_LIMIT top-K returned by search_memory (default: 30) MEMENTO_AWAIT_INDEXING_MS per-question polling deadline (default: 180000) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Review the following changes in direct dependencies. Learn more about Socket for GitHub.
|
13 tasks
veerps57
added a commit
to veerps57/memento
that referenced
this pull request
May 15, 2026
## Problem Two needs motivated this work; running against the bench surfaced two more engine gaps that are shipped fixed in the same PR. 1. **Memento has no published end-to-end benchmark.** The MCP-registry launch built credibility on architectural commitments and a clean API; the next step toward trust is "here are the numbers, here is how to reproduce them." We need a reusable harness against the de-facto industry datasets (LoCoMo, LongMemEval) so a sceptical engineer can re-run a baseline on their laptop and verify. 2. **Memento's `extract_memory` contract had several first-try-wrong footguns for AI assistants doing distillation.** While building the benchmark provider, every distillation attempt failed silently because the candidate shape, the topic-line requirement, and the async-mode receipt semantics weren't surfaced in the MCP tool description (some were in the skill but invisible to a tool consumer reading only `tools/list`). The tool's own example was itself non-validating. Tag-regex failures returned a bare `(root): Invalid`. These are the kind of friction that turns "give Memento a try" into "give Memento up." 3. **FTS5 recall was stem-blind for prose queries.** The bench's first low-score question (LongMemEval `d24813b1`, "tips on what to bake for colleagues") missed the gold-truth memory because the memory said "colleague's going-away party" and the query said "colleagues" — the default `unicode61` FTS5 tokenizer treats them as different tokens. The `retrieval.fts.tokenizer` config key advertised `porter` as a tunable alternative, but no migration or runtime code ever read it: dead-code tunability. Vector embedding rescued some morphological misses, but not enough, and the failure mode is exactly the one a durable memory layer needs to handle well (the speaker's wording and the future question's wording rarely match in surface form). 4. **`embedder-local.embedBatch` was sequential under the hood.** The implementation looped `extractor(text, ...)` per row with a comment pointing at the transformers.js v2 limitation. transformers.js v3 (already pinned via `^3.0.0`) accepts an array input and runs one forward pass for the whole batch — verified row-by-row numerically identical to the single-call form. ## Change Four coordinated workstreams ship together. The bench surfaced (3) and (4); the fixes that close them improve Memento for every assistant doing memory work, not just for the bench's score. - **Bench driver — `scripts/bench.mjs` + `docs/guides/benchmark.md`.** A vanilla-Node ESM driver that builds Memento, stages a memorybench fork at a pinned ref (or a local checkout via `--memorybench-dir`), spawns one `bun run src/index.ts run -p memento -b <bench>` per requested benchmark, and writes a single summary markdown to `bench/<ts>.md` (the `bench/` directory is git-ignored). Defaults to LoCoMo + LongMemEval with `sonnet-4.6` pinned for judge + answering + distillation — the model class that actually shows up on the conversation side in real Memento usage (Claude Code, Cursor, and Claude Desktop are the MCP-using-client majority, and `extract_memory` distillation happens in *that* same assistant). Sonnet 4.6 supports `temperature=0` (deterministic at the model layer) and the alias is registered in the fork's `MODEL_CONFIGS`. Top-K=30, 180s indexing deadline. Per-phase concurrency flags pass through to memorybench so a slow embedder or a throttled Anthropic endpoint can be tamed (`--concurrency-ingest=1` is the safe knob for `sonnet-4.6` under bursty rate-limit pressure, and it also lets the provider's per-session distillation cache hit when questions share sessions). The driver spawns the locally built CLI via `process.execPath` and asserts `better-sqlite3` loads under that exact Node before doing any expensive work, so a `nvm + homebrew` PATH cocktail can't crash the run with a confusing "MCP error -32000: Connection closed". A `--resume=<runId>` flag picks up at the failed phase of the failed question for a crashed run (memorybench's orchestrator checkpoints after every phase boundary); the runId is logged on a dedicated line in the bench log and reprinted as a copy-pasteable command on any non-zero exit. `--out` anchors to the Memento repo root regardless of `cwd` so running from inside the fork checkout doesn't leak the output directory into the fork worktree. The provider implementation itself lives in a fork of `supermemoryai/memorybench` ([`veerps57/memorybench@add-memento-provider`](https://github.com/veerps57/memorybench/tree/add-memento-provider)). - **`extract_memory` tool surface + distillation craft.** The MCP tool description on `extract_memory` states the candidate-shape difference from `write_memory` (flat `kind` enum, top-level `rationale`/`language`), the `topic: value\n\nprose` requirement for `preference`/`decision` kinds, the `storedConfidence: 0.8` async-default, and the receipt-not-failure semantics of `mode: "async"`. The inline example exercises four kinds with the correct field placement — including a `preference` candidate that opens with the required topic-line and a `decision` candidate with top-level `rationale`. `TagSchema` carries a custom error message listing the allowed charset so `April 15, 2026` produces an actionable diagnostic instead of a bare "Invalid". The skill (`skills/memento/SKILL.md`), persona-snippet guide (`docs/guides/teach-your-assistant.md`), and the landing-page persona-snippet mirror (`packages/landing/src/App.tsx`) carry a "Distillation craft" section that frames the task as **retrieval indexing for unknown future queries** (not summarisation for a reader) and codifies six rules in priority order: (1) preserve specific terms — proper nouns, identity qualifiers, named entities, places, and the specific object of every action; (2) capture facts about every named participant, not only the user — a friend the user mentions ("my friend Alex is moving to Berlin for a SAP job") or a co-speaker in a multi-party transcript both deserve candidates attributed to the right named person, not collapsed onto the user; (3) emit a candidate for every dated event with the date resolved against the session anchor, never collapsing it to an untimed habit; (4) capture precursor actions alongside outcomes — "researched X then chose Y" emits two candidates, since future questions can target either step; (5) don't squash enumerations into category labels; (6) bias toward inclusion — the server dedups via embedding similarity, so over-including is cheap and under-including is permanent. A pre-emit self-check ("did every date, named entity, and verb-with-specific-object map to a candidate?") sits alongside the rules in each surface. - **Porter stemming for FTS5 — migration 0008 + honoured config + default flip.** `memories_fts` is now built with `tokenize='porter unicode61'` instead of the default `unicode61`. The chain runs right-to-left: unicode61 splits + diacritic-folds first (so non-ASCII content still tokenises correctly — German umlauts, French diacritics, Japanese katakana all survive intact, covered by tests), then porter stems the resulting ASCII tokens. "colleague", "colleagues", and "colleague's" share a stem and match each other; "bake" matches "baking" / "baked" / "bakes"; "research" matches "researched" / "researches"; "agency" matches "agencies". The `retrieval.fts.tokenizer` config key now defaults to `porter` and is documented as honoured by the FTS index (it was previously declared but ignored by the migration — dead-code tunability that this change makes real). Migration 0008 drops and rebuilds `memories_fts` with the new tokenizer, preserving stable rowids via the `memories_fts_map` table; the runner applies it on first server start after upgrade, so no operator action is required. Six new unit tests cover stem-variant matching (plural/singular, verb-form pairs), pre-migration re-indexing, the insert/update/delete triggers carrying the new tokenizer through write-path operations, and non-ASCII preservation. The trade-off accepted is porter's known over-stems (organize/organic, universe/university); for Memento's dominant query distribution — assistants asking about durable user state in natural language — recall on stem variants is worth more than precision on these edge cases. Operators who need the older behaviour can author a follow-up migration; the config key documents the option. - **Embedder perf — real batched feature-extraction in `@psraghuveer/memento-embedder-local`.** `embedBatch` now uses transformers.js v3's array-input pipeline, which runs one forward pass for the whole batch instead of looping per text. Numerically identical to the single-call form (verified row-by-row against the same input). Measured ~1.8× speedup on a 3-input batch with `bge-base-en-v1.5` on CPU; the speedup grows with batch size because tokenisation and pipeline setup amortise across the batch. The loader contract now returns `{ embed, embedBatch? }` instead of a bare `embed` function; loaders that omit `embedBatch` fall back to the previous sequential behaviour, so test fixtures and bespoke implementations keep working unchanged. Seven new unit tests cover the fast path, the sequential fallback, empty-input short-circuit, runtime-row-count mismatch, per-row dimension validation, batched `maxInputBytes` truncation, and whole-batch timeout. The `EmbeddingProvider.embedBatch` surface in `@psraghuveer/memento-core` is unchanged and remains optional; existing call sites that go through `embedBatchFallback` (`pack.install`, `import`, `embedding.rebuild`, the synchronous extract slow-path) automatically pick up the fast path. ## Justification against the four principles - **First principles.** The benchmark driver introduces no new behavioural constants in Memento itself — every knob is a CLI flag or env var declared in `DEFAULTS` at the top of `bench.mjs`. The tool-description changes surface constraints that already existed in code (schema validation, the conflict-detector's topic-line parsing) where an assistant reading `tools/list` will see them. The FTS-tokenizer change does flip a default (`retrieval.fts.tokenizer: unicode61 → porter`), but the migration that effects it is the canonical Memento way of evolving stored state — and the config key that controls it has shipped since the registry release and is now actually honoured. The embedder change is purely a perf path; semantics are byte-identical to the previous behaviour. - **Modular.** `scripts/bench.mjs` is a thin driver — the provider lives in a fork of memorybench, the harness is memorybench's, and the judge/answering models are memorybench's. The Memento side is one script + one guide + a pinned fork ref. The distillation-clarity changes touch documentation and a single Zod error message; no behavioural code paths are added. The FTS change is a single migration file + one config-key default flip. The embedder change extends the existing `EmbedRuntime` shape with an optional `embedBatch` and adds a single fast-path branch in the wrapper; the rest is sequential-fallback compatibility. - **Extensible.** Adding a third benchmark (ConvoMem) is a one-line change to `DEFAULTS.benchmarks`. Adding a different judge family means pointing `--judge` at a different model alias; the script's API-key check fans out by family. The skill's distillation-craft section is positioned so a future contributor can extend the rules without restructuring. The FTS migration's pattern (drop → rebuild → repopulate → retrigger) is the same as 0005's — future tokenizer changes follow the same template. The loader contract's optional `embedBatch` lets bespoke embedders opt into batching when their runtime supports it, without forcing a contract upgrade on the others. - **Config-driven.** Every benchmark default (model, ref, limit, concurrency, search-K, indexing deadline) is overridable from the command line or env. The FTS-tokenizer choice is `retrieval.fts.tokenizer` — operators can stay on `unicode61` by setting it before first server start and recreating the FTS table via a follow-up migration. The embedder change adds no new config key (the runtime contract change is internal); operators with custom loaders are unaffected by default. ## Alternatives considered - **Vendor memorybench inside the Memento repo.** Rejected: keeps the harness external (so we don't own its release cadence) and lets the provider land as a normal contribution upstream. The driver pulls a pinned fork ref, so reproduction is exact. - **Add LLM-driven distillation inside `extract_memory` itself.** Rejected: Memento's architectural commitment is local-first and LLM-agnostic. Baking in an LLM would either pull in a cloud provider (breaking local-first) or ship a bundled local model (breaking LLM-agnostic and adding ops complexity). Distillation belongs to the calling AI assistant, where the conversation context lives. The bench provider does its own distill step to mirror that flow. - **Re-design the candidate shape so `write_memory` and `extract_memory` accept the same payload.** Considered, rejected: the discriminated-union shape on `write_memory` is the right design for a single-row call where kind-specific metadata is the point; the flat shape on `extract_memory` is the right design for a batch where the per-item type is data, not a routing tag. Documenting the difference is correct; collapsing them would weaken both APIs. - **Keep `unicode61` as the FTS default and ship porter as an opt-in only.** Rejected: the `retrieval.fts.tokenizer` config key was already documented as the operator-tunable knob, and validation of the porter path on a real bench question showed unicode61 missing the gold-truth memory at the FTS layer entirely. The migration is the right place to flip the default because anyone who actively wants unicode61 can author a follow-up migration; the silent majority who never touched the key get a measurable recall improvement. - **Heavier embedder optimisations — quantisation (`dtype: 'q8'`), worker thread, WebGPU.** Deferred: quantisation is a recall trade-off that needs its own evaluation pass; worker threads improve event-loop responsiveness without raising throughput on a single CPU; WebGPU only helps browser hosts (Memento runs on Node). Real batched feature-extraction is the largest no-trade-off win available today, so it's the one shipped here. ## Tests - [x] Unit — full unit suite passes on this branch, plus 13 new tests (six for migration 0008 covering stem variants, pre-migration re-indexing, triggers, and non-ASCII preservation; seven for the embedder fast-path and sequential-fallback paths). - [ ] Integration — N/A; no new integration paths added beyond the existing extract path which is already integration-tested. - [x] Migration — `0008_fts_porter_tokenizer` is forward-only, idempotent on a fresh DB, and verified end-to-end against a pre-0008 install via `MIGRATIONS.slice(0, 7)` in the test suite. - [x] End-to-end — the existing `serve` e2e passes. The bench itself is the new end-to-end exercise but is not part of `pnpm verify` for the reasons documented in `docs/guides/benchmark.md` (it needs network, judge API keys, and hours of wall-clock — CI must pass offline). A focused 1Q LongMemEval validation against the baking question confirmed the porter fix lifts that question from 0 → 1 correct, with the lemon-poppyseed memory ranking #4 in retrieval where previously it didn't reach top-30. - [ ] N/A — see above. ## Local verification - [x] `pnpm verify` (<!-- verify-chain:begin -->lint → typecheck → build → test → test:e2e → docs:lint → docs:reflow:check → docs:links → docs:check → format:packs:check → server-json:check<!-- verify-chain:end -->) — all green at branch HEAD. - [x] `pnpm docs:generate` — run; `docs/reference/{cli,mcp-tools,config-keys}.md`, `AGENTS.md`, `CONTRIBUTING.md`, `.github/copilot-instructions.md`, and `.github/PULL_REQUEST_TEMPLATE.md` regenerated to pick up the new `extract_memory` description, the `TagSchema` error message, and the `retrieval.fts.tokenizer` default + description. ## ADR - [ ] An ADR is required and is included in this PR. - [ ] An ADR is required and exists already (link below). - [x] No ADR required (explain why): The bench driver, the tool-description changes, the embedder fast path, and the FTS tokenizer migration are all within the ADR exemption list in `AGENTS.md`: - The bench driver is optional tooling — it adds a script and a guide, doesn't change the public surface, the data model, scope semantics, or any top-level dependency. - The `extract_memory` tool-description and `TagSchema` error-message changes make existing contracts more discoverable without changing them. - The embedder fast path is a perf optimisation with byte-identical output; no semantic change. - The FTS tokenizer change is a forward-only migration that honours an already-documented config key (`retrieval.fts.tokenizer`). The default flip is operator-visible but it neither introduces a new behavioural constant nor changes a contract — it activates a knob that already shipped. Memento's stance on tokenizer choice was always "operator-configurable, default may evolve as the use case sharpens" (per the config key's description). ## AI involvement - [ ] No AI assistance. - [ ] AI assistance for boilerplate / drafting only. - [x] AI authored substantial portions. I have verified every line. The bench driver, the provider in the memorybench fork, the audit of `extract_memory`'s distillation-friction surface, the porter migration + tests, the embedder batching + tests, and the prose updates to the skill / persona guide / landing snippet were drafted with Claude. Every change was reviewed and exercised end-to-end against LoCoMo and LongMemEval smokes through the full pipeline (distill → write → indexing → search → answer → judge). The Zod error-message change and the tool-description text were verified against the actual code paths they describe. The porter fix specifically was validated by re-running the same failed bench question against the new code and confirming the gold-truth memory now ranks at the top of the retrieved set with the same models, same haystack, same scope. ## Linked issues Corresponding memorybench PR: supermemoryai/memorybench#43 Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds Memento — a local-first, MCP-native memory layer published on the MCP registry — as a fifth
Provideroption in memorybench, alongside Supermemory, Mem0, Zep, and the filesystem baseline. Memento runs as a stdio subprocess against a single SQLite database; this integration exercises it through that same MCP transport, so the bench measures the realistic shape (process spawn, JSON-RPC over stdio, async write + auto-embed, hybrid FTS+vector retrieval).No API key required for Memento itself. The provider performs a per-session distillation step via an LLM call before handing candidates to Memento's
extract_memorytool; the model is configured via theMEMENTO_DISTILL_MODELenv var and falls back to memorybench'sDEFAULT_ANSWERING_MODELconstant (gpt-4o) when unset. So an OpenAI key is needed for the default distill model; setMEMENTO_DISTILL_MODELto use a different family.Motivation
Memento was published to the MCP registry yesterday — a local-first, MCP-native memory layer for AI assistants that need durable, structured memory. Publishing a memory project on the registry without numbers next to the established ones felt incomplete; this PR proposes the integration so Memento can be measured against Supermemory, Mem0, Zep, and the filesystem baseline on the same harness, datasets, and judges. The honest way to answer "how does it compare?" is to run it through your bench and let the numbers speak.
We're not asking for any reorientation of the bench. Memento slots into the existing
Providerinterface unchanged. The harness, datasets, judges, and other providers are not touched.How the provider works
MementoProviderimplements the five-methodProviderinterface:initialize— spawnsmemento serve --db <tmp>via@modelcontextprotocol/sdk'sStdioClientTransport, asserts the required MCP tools (extract_memory,search_memory,forget_many_memories) are present ontools/list, runs one warmup write+forget pair so the embedder model is loaded before the first benchmark question, and calls.unref()on the spawned child process and its stdio pipes so the Node event loop doesn't keep waiting on the subprocess after the bench's main work completes (without this, the bench hangs at exit onRun complete!).ingest— for each session in theUnifiedSession, calls an LLM to distill the transcript into structured{kind, content, summary?}candidates, then hands the batch to Memento'sextract_memoryMCP tool. Memento embeds, scrubs, dedups, and persists. The distillation model is read fromMEMENTO_DISTILL_MODELand falls back to memorybench'sDEFAULT_ANSWERING_MODELconstant when unset. Memories land underscope = {type: 'workspace', path: '/memorybench/<containerTag>'}(Memento'ssessionscope requires a ULID, while memorybench'scontainerTagis an arbitrary string — workspace scope is the right isolation primitive). Each candidate carriesbenchmark:memorybench,session:<id>, and (when present)session-date:<iso>tags. A per-run distillation cache keyed bysession.sessionIddeduplicates LLM calls when questions in the same conversation share sessions; the cached output is whatever the first distill produced for that session.awaitIndexing— pollssearch_memoryon the question's scope until every result hasembeddingStatus !== 'pending'(orMEMENTO_AWAIT_INDEXING_MS, default 180s, elapses). Memento's auto-embed runs fire-and-forget after each write; this is the bridge.search— runssearch_memorywith the question's scope filter,projection: 'full', andlimit: this.searchLimit(default 30, overridable viaMEMENTO_BENCH_SEARCH_LIMIT). The 30 default is the same number Supermemory and Mem0 use — every distill-style provider that I read in this repo overrides the orchestrator'slimit: 10to give the answering model a richer haystack.clear— callsforget_many_memorieswith the per-question scope as the filter. The orchestrator doesn't call this in normal runs, but it's there for partial-rerun recovery.The provider supplies a custom
answerPrompt(src/providers/memento/prompts.ts) that presents each retrieved memory with its score, kind, and session date — the latter being the temporal anchor Memento captures during distillation. Format follows the same shape asfilesystem/prompts.tsandsupermemory/prompts.ts: structured retrieved-context block + numbered "How to Answer" steps + "I don't know" refusal clause + Reasoning/Answer output template.What's in the PR
New files (all under
src/providers/memento/):index.ts— provider class (MementoProvider), session ingest with the distillation cache, per-question scope mapping, MCP client lifecycle. ~420 lines.distill.ts— the per-session LLM distillation step. ReadsMEMENTO_DISTILL_MODEL(falling back toDEFAULT_ANSWERING_MODEL), builds a transcript prompt, parses the JSON response into typed candidates. The prompt codifies six craft rules (preserve specific terms; capture facts about every named participant; emit a candidate for every dated event; capture precursor actions alongside outcomes; don't squash enumerations; bias toward inclusion). ~235 lines.prompts.ts— customProviderPromptswith the MementoanswerPrompt. ~70 lines.mcp-helpers.ts—parseSearchPageandparseToolResultJsonhelpers for the MCP-result envelope. ~85 lines.Edits to existing files:
src/providers/index.ts— registersmemento: MementoProviderin the providers map.src/types/provider.ts— adds"memento"to theProviderNameunion.src/utils/config.ts—getProviderConfig('memento')returns minimal config (Memento has no API key; configuration is via theMEMENTO_*env vars documented in the README).src/cli/index.ts— extends thehelp providersprinter.src/utils/models.ts— adds two new Anthropic aliases (sonnet-4.6→claude-sonnet-4-6,opus-4.6→claude-opus-4-6). These are catalog additions, useful to all providers, not Memento-specific — happy to split into a separate PR if you'd prefer; they were a dependency of Memento's default answering-model choice on local validation.README.md— addsmementoto the-pflag's provider list and documents theMEMENTO_*env vars alongside the existing provider config block.package.json— adds@modelcontextprotocol/sdk: ^1.29.0(the canonical MCP client library; pin matches what Memento itself ships).Design choices worth flagging
A few choices that look like they could be tuning if you don't read the implementation — happy to revisit any of them:
limit: 10and every distill-style provider in this repo (supermemory,mem0's fallback) overrides it the same way. The 10 default is used for Hit@K computation; the answering model still benefits from a richer haystack. We follow the cluster norm.session.sessionId. Cache hits return exactly what the first distill produced for that session — LLM output isn't strictly deterministic even attemperature=0, so the cache also has the side effect of making within-run distillation reproducible across questions that share sessions.MEMENTO_AWAIT_INDEXING_MSdefaults to 180s. Memento's auto-embed is fire-and-forget per write; the bench's first read happens after this deadline regardless. For runs where many questions × many memories per scope outpace the local CPU embedder (notably the larger LongMemEval questions, which have ~50 sessions × ~25 candidates each), operators can bump this knob so the embedding queue drains before search. Documented in the README.sessionscope requires a ULID; memorybench passes arbitrarycontainerTagstrings. Using{type: 'workspace', path: '/memorybench/<containerTag>'}gives us a stable, immutable scope per question that respects Memento's scope-is-immutable rule. The reverse-mapping is the natural fit; happy to discuss alternatives.Testing
Manually tested end-to-end against LoCoMo and LongMemEval (limit and random-sample modes), including a
--resumeflow after an AnthropicOverloaded529 mid-run. No new unit tests added — the provider package mirrors the structure of the other four providers, none of which carry their own test files in this repo. Happy to write some if you'd prefer a different bar than the existing precedent.Verified runs were on:
bun --version1.3.13node --versionv22.19.0 (the workspace Node)sonnet-4.6across answering / distillation / judge;opus-4.6separately exercised as the distillation model withsonnet-4.6for answering + judgeBackwards compatibility
Strictly additive. The new provider is opt-in via
-p memento. The orchestrator, the dataset loaders, the judges, and the four existing providers are not touched. The new dependency (@modelcontextprotocol/sdk) is a small library that adds ~1 MB tonode_modules; it loads with the providers registry like any other static dep.The two new model aliases are also additive — they don't change resolution for any existing alias.
Future work (out of scope for this PR)
@modelcontextprotocol/sdkso the provider can be tested without spawning a realmemento serve. Left out here to match the current precedent (no per-provider unit tests insrc/providers/).UnifiedSession, so it should work without changes, but I haven't run it.License + provenance
Memento is Apache-2.0 (same as memorybench). The provider code in this PR is original and licenses under the same terms.