MCP spec should address tool schema token overhead (~1000 tokens/tool consumed per session) #2812

luw2007 · 2026-05-28T13:39:33Z

luw2007
May 28, 2026

MCP spec should address tool schema token overhead (~1000 tokens/tool consumed per session)

Date: 2026-05-28
Reporter: @luwei-will
Project: context-mode (Claude Code plugin)
Scope: 11 MCP tools — token footprint per tool definition

1. Executive Summary

MCP tool definitions consume 5–15× more tokens than the simplest possible schema for the same tool (type-only, no descriptions). In a typical Claude Code session with 20–30 registered MCP tools, the tool schema alone occupies 15–30 KB of context window before a single user message is sent. This is a structural inefficiency in the MCP protocol: context window capacity is consumed even when billing is amortized via prompt caching.

This issue documents measured token overhead of 11 production MCP tools, identifies the root cause, and proposes three protocol-level mitigations.

2. Measured Data

Tool	Tokens	Relative to `ctx_stats`
`ctx_batch_execute`	1,024	9.9×
`ctx_execute`	1,024	9.9×
`ctx_fetch_and_index`	972	9.4×
`ctx_execute_file`	822	8.0×
`ctx_index`	858	8.3×
`ctx_search`	785	7.6×
`ctx_purge`	646	6.3×
`ctx_insight`	339	3.3×
`ctx_upgrade`	127	1.2×
`ctx_doctor`	107	1.0× (baseline)
`ctx_stats`	103	1.0×

Methodology: Token counts measured via Anthropic's token counting API (/v1/messages/count_tokens) with each tool definition serialized as its full JSON Schema including descriptions. Counts represent input tokens consumed per tool per session entry.

Key observation: Heavy tools (ctx_batch_execute, ctx_execute, ctx_fetch_and_index) each cost ~1,000 tokens to define. Light tools (ctx_stats, ctx_doctor) cost ~100 tokens. The 10× delta is entirely in JSON Schema size — parameter descriptions, type definitions, and nested object structures.

3. Root Cause Analysis

3.1 Schema Bloat

MCP requires every tool to expose a full JSON Schema. For ctx_batch_execute, the schema includes:

commands: array of objects, each with label (string) and command (string)
queries: array of strings
concurrency: integer with range constraints
timeout: integer

The field descriptions (required by MCP for LLM reasoning) add ~400 tokens. The type definitions add ~300. The nested object structure for commands adds ~300 more. Total: ~1,000 tokens for one tool that a human could describe in two sentences.

3.2 First-Turn Tax and Cache Fragility

Tool schemas are injected into the system prompt prefix. While prompt caching avoids re-billing cached prefixes on subsequent turns, the overhead is real in these scenarios:

First turn of every conversation — full cost paid, no cache
Cache eviction — any tool added, removed, or updated invalidates the entire prefix, forcing a cold re-injection
Context window capacity — cached or not, 10,000 tokens of schemas occupy attention slots, reducing the model's effective working memory for reasoning
Short conversations — cache amortization fails when sessions are fewer than ~5 turns, which is common for quick lookups

With 20 tools averaging 500 tokens each, 10,000 tokens of context window are consumed by schemas regardless of billing amortization.

4. Impact on Real Workloads

From production usage (context-mode v1.0.151, 22 days, 2,600 conversations):

Per-conversation schema cost (first turn, no cache): ~10,000 tokens × $15/1M (Opus input) = $0.15/conversation
Across 2,600 conversations: ~$390 in first-turn schema cost
With 75% prompt cache hit rate: effective cost ~$0.04/conversation ≈ ~$100
Context window opportunity cost: 10K tokens is approximately 5 pages of reasoning capacity permanently unavailable to the model per session

The billing cost is moderate. The reasoning capacity cost is the primary concern: a model reasoning over a complex task with 20 MCP tools registered has 10,000 fewer tokens available for actual work.

5. Proposed Mitigations

5.1 Tiered Schema Detail (Short-Term)

Allow MCP servers to register tools at two detail levels:

Discovery tier: Name + one-line description + parameter names/types only (no field docs)
Invocation tier: Full schema with descriptions, injected on-demand when the model selects the tool

The model sees the discovery tier in the system prompt. When it decides to call a tool, the runtime injects the invocation-tier schema for that specific tool into the next turn.

{
  "name": "ctx_batch_execute",
  "description": "Run commands in parallel, auto-index output, return matching sections.",
  "discovery_schema": {
    "type": "object",
    "properties": {
      "commands": {"type": "array"},
      "queries": {"type": "array"},
      "concurrency": {"type": "integer"},
      "timeout": {"type": "integer"}
    }
  },
  "invocation_schema": "<full schema with descriptions>"
}

Estimated saving: 60–70% of per-session schema tokens.
Tradeoff: Slightly reduced tool selection accuracy for rarely-used tools with non-obvious names.

5.2 Explicit Schema Versioning (Medium-Term)

Add a schema_version field to MCP tool registrations. The host runtime tracks versions and only re-sends schemas to the model when the version changes. This makes the caching contract explicit rather than relying on implicit prompt-prefix matching, and enables delta updates (add/remove a single tool without invalidating the entire prefix).

{
  "name": "ctx_batch_execute",
  "schema_version": "2.1.0",
  "schema": { ... }
}

Estimated saving: Prevents full prefix invalidation on incremental tool changes.

5.3 Tool Namespacing (Long-Term)

Allow tools to be grouped under namespaces with shared schema prefixes. Instead of 20 top-level tools, expose 3–5 namespaces with 4–6 tools each. The namespace schema (shared parameter patterns, shared description vocabulary) is sent once; individual tools inherit and override.

Estimated saving: 30–40% of schema tokens via deduplication of repeated patterns.

6. Reproduction

# Measure token cost of a single tool definition
curl -s https://api.anthropic.com/v1/messages/count_tokens \
  -H "x-api-key: $ANTHROPIC_API_KEY" \
  -H "anthropic-version: 2023-06-01" \
  -H "content-type: application/json" \
  -d '{
    "model": "claude-opus-4-5",
    "tools": [<paste full MCP tool JSON schema here>],
    "messages": [{"role": "user", "content": "hi"}]
  }'

# Compare count with vs. without tools to isolate schema overhead

7. Request

The MCP protocol is well-designed for capability discovery, but the cost model assumes schemas are cheap. They are not. Every token of schema is a token of context window — and context window capacity directly bounds model reasoning quality and session length.

Requested additions to the MCP specification:

A "schema efficiency" recommendation section advising a maximum schema size per tool (suggested: 300 tokens for discovery, 1,000 for invocation)
A tiered schema detail mechanism (discovery vs. invocation tiers)
A schema_version field in tool registration for explicit cache invalidation

This would benefit all MCP implementations and hosts, not just context-mode.

Data from production usage of context-mode v1.0.151 over 22 days, 2,600 conversations.

PengSpirit · 2026-05-29T15:03:40Z

PengSpirit
May 29, 2026

The 10× delta being entirely in schema size is the right diagnosis, but the mitigation framing worth pinning down first is: not all of those tokens are equal, and "strip the schema" optimizes the wrong variable. The token cost and the tool-selection-reliability are in direct tension, and the two failure modes have very different blast radii — a bloated schema costs you context window (recoverable, amortized by prompt caching), but an under-described schema costs you wrong-tool-selection and hallucinated parameters (a silent correctness failure that shows up as "the agent went into a debugging loop blaming the harness").

So the question isn't "how do we cut tokens" — it's "which tokens earn their place." Sorting your own measured data by tokens-per-unit-of-disambiguation rather than raw tokens:

High legibility-per-token (keep): anti-purpose clauses ("not for X"), enum constraints on finite-value string params, the one verb+scope sentence. These are cheap and they're exactly what stops the model guessing.
Low legibility-per-token (candidate to trim): restating the type in prose ("a string representing the name"), nested object descriptions that duplicate the field names, example blocks longer than the description itself.

ctx_batch_execute at 1,024 tokens is almost certainly carrying both kinds — the commands: array of {label, command} nesting is where the prose tends to duplicate structure the schema already encodes.

On the protocol-level proposals: a lazy/on-demand schema fetch (send names + one-liners up front, full schema on first call) is the most promising of the three because it preserves legibility at call time while cutting the up-front tax — but it shifts the wrong-tool-selection risk earlier (the model picks from one-liners), so the one-liner quality becomes load-bearing in a way it isn't today.

If it's useful to quantify the "which descriptions earn their tokens" split rather than eyeball it: Anthropic's /v1/messages/count_tokens (which you're already using) gives you the cost side. For the legibility side, npx @incultnitollc/mcp-probe test "<launch command>" scores each tool on a five-axis breakdown (description quality, enum-shape, anti-purpose presence, mutation-legibility, distribution-metadata) — cross-referencing the two tells you which high-token tools are paying for legibility vs paying for prose bloat. Different measurement than Inspector's interactive view; this one's batch over the live launch command so it catches the runtime-config-dependent schemas too.

1 reply

leuasseurfarrelds247-arch Jun 1, 2026

Ben-Home · 2026-05-31T18:07:19Z

Ben-Home
May 31, 2026

Sharing real data from a production hosted MCP server (CorpusIQ, mcp2.corpusiq.io) with 53 tools across 25+ business data sources. This is exactly the overhead problem we run into.

Our measured footprint at 53 tools:

Full schema in context: ~47,000 tokens (~886 tokens/tool average)
Filtered to 5 most relevant tools: ~4,400 tokens
Tool name list only: ~280 tokens

What we're doing in practice:

We expose tool groupings by domain (e-commerce, finance, analytics, CRM) and let the agent select which group to load rather than dumping all 53 schemas upfront. A Shopify-focused agent loads ~8 tools (~7,000 tokens) instead of all 53.

The spec could help here with two additions:

Tool namespacing — group/tool_name conventions so clients can request a subset: list_tools(group="ecommerce")
Schema levels — list_tools(verbosity="minimal") returns name+description only, verbosity="full" returns parameters. Agent decides which level it needs before calling.

Both of these are backward-compatible additions that would meaningfully reduce context overhead for servers with 20+ tools without requiring any behavior changes for simple servers.

For reference: https://corpusiq.io — any MCP client can test the 53-tool overhead directly via mcp2.corpusiq.io/mcp.

1 reply

gustavo-sec Jun 1, 2026

@Ben-Home in your example, is a single server for all your domains still the best option if you won't expose the entire surface to hosts?

I see that AWS for example has different MCP servers for different domains (cli, documentation, pricing, etc..). This seems like a good practice to me because you can keep the surface relatively small while still allowing clients to know all tools beforehand.

So you'd have https://acme.org/mcp-e-commerce, https://acme.org/mcp-finance, etc.. (assuming http).

This doesn't solve anything in the OP, just a convention that I've decided to adopt in some servers I built, especially since hosts like Claude Code have a limit of 30 tools loaded since they don't paginate.

At Gated for any server that exposes more than 30 tools we yield a low severity (and suppressible) finding, as we believe that clients have the right to decide how many tools to load, and 30 is a good number, especially since some clients already adopt it.

gustavo-sec · 2026-06-01T14:45:58Z

gustavo-sec
Jun 1, 2026

Strong +1 to @chopmob-cloud. Framing the 5–15× as a discipline failure rather than a protocol constraint matches what we see too.

The protocol mitigations (tiered schema, versioning) are worth having, but they paper over per-tool authoring habits that are fixable today at zero protocol cost.

One thing I'd add: the redundancy problem is measurable, which means it's enforceable in CI. Tool-description vs property-description double-counting, prose that restates the type, example blocks longer than the description — these are detectable as a schema lint, not just a code-review judgment call. Treating "schema token budget per tool" as a checked constraint (the way you'd check bundle size) catches the regression before it ships, instead of discovering it after 29 tools have each drifted to 1,000 tokens.

That keeps @PengSpirit point intact: the goal isn't fewer tokens, it's tokens that earn their place. A lint that flags redundancy trims bloat without touching the anti-purpose clauses and enum constraints that are doing the disambiguation work.

Disclaimer: I'm building Gated, an audit/scanning platform for MCP servers, so schema quality is squarely in my wheelhouse — flagging that for transparency.

0 replies

mnifzied-create · 2026-06-06T19:13:34Z

mnifzied-create
Jun 6, 2026

A cross-implementation data point, since the numbers here so far are each from a single server (11 / 29 / 53 tools): I measured 13 real open-source MCP servers + agents (79 tools total) through one tokenizer so they're directly comparable. Full dataset + method is here; the parts relevant to the spec question:

Per-server overhead (schema only, compact JSON):

GitHub MCP — 3,546 tok (26 tools)
GitLab — 1,194 (9) · Git — 1,117 (12) · Slack — 679 (8) · Google Maps — 547 (7) · Fetch — 236 (1)
Median across all servers: 547 tok/turn; average tool 123 tok (median 103).

That corroborates the ~100-tok floor and ~800–1,200-tok heavy tools measured upthread — it just generalizes across the ecosystem rather than one server.

Two findings that I think sharpen @PengSpirit's "which tokens earn their place" and @gustavo-sec's "enforceable in CI":

1. A large slice of the bloat is tokens nobody authored. Serialization alone swings the bill ~20% on the identical tool — Fetch MCP is 236 tok compact vs 288 pretty-printed. Pydantic's model_json_schema() auto-adds a title to every field; zod-to-json-schema appends $schema and additionalProperties. None of that is a description that helps tool-selection — it's whitespace and converter artifacts, re-sent every turn. So the tension partly dissolves: you can cut a real fraction without touching the descriptions that earn their keep. That's precisely what a schema-lint / per-tool token budget can enforce mechanically — flag pretty-printing, auto-title, redundant $schema/additionalProperties, and prose that just restates the type.

2. The single fattest tool in the set is sequentialthinking from the official servers repo — 827 tokens in one tool, almost all of it a ~565-token natural-language description around a 9-field schema. It's larger than the entire toolset of 8 of the 12 multi-tool servers measured. Not a knock on that tool (the prose is doing genuine disambiguation work, per @PengSpirit) — but it's a clean illustration that "one tool" and "~1,000 tokens" can be the same thing, which is the OP's structural point.

Honesty caveat: my cross-server counts use o200k_base (GPT BPE) as a consistent estimate for Claude — applied identically to all 13, so relative comparisons hold; absolute Claude figures (via count_tokens, as the OP used) differ by a few %. Method + pinned source commits are in the writeup.

Net: tiered / verbosity list_tools is the right protocol-level lever — but the serialization-artifact slice is recoverable today, client-side, at zero spec cost and zero accuracy cost, which seems like the cheapest win to surface first.

0 replies

This comment was marked as spam.

Sign in to view

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MCP spec should address tool schema token overhead (~1000 tokens/tool consumed per session) #2812

Uh oh!

{{title}}

Uh oh!

Replies: 5 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

This comment was marked as spam.

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

MCP spec should address tool schema token overhead (~1000 tokens/tool consumed per session) #2812

Uh oh!

luw2007 May 28, 2026

MCP spec should address tool schema token overhead (~1000 tokens/tool consumed per session)

1. Executive Summary

2. Measured Data

3. Root Cause Analysis

3.1 Schema Bloat

3.2 First-Turn Tax and Cache Fragility

4. Impact on Real Workloads

5. Proposed Mitigations

5.1 Tiered Schema Detail (Short-Term)

5.2 Explicit Schema Versioning (Medium-Term)

5.3 Tool Namespacing (Long-Term)

6. Reproduction

7. Request

Replies: 5 comments · 2 replies

Uh oh!

PengSpirit May 29, 2026

Uh oh!

leuasseurfarrelds247-arch Jun 1, 2026

This comment was marked as spam.

Uh oh!

Ben-Home May 31, 2026

Uh oh!

gustavo-sec Jun 1, 2026

Uh oh!

gustavo-sec Jun 1, 2026

Uh oh!

mnifzied-create Jun 6, 2026

luw2007
May 28, 2026

Replies: 5 comments 2 replies

PengSpirit
May 29, 2026

Ben-Home
May 31, 2026

gustavo-sec
Jun 1, 2026

mnifzied-create
Jun 6, 2026