Preamble
Title: Mitigating Token Bloat in MCP: Reducing Schema Redundancy and Optimizing Tool Selection
Authors: Zeze Chang (changzeze@huawei.com), Jinyang Li (lijinyang9@huawei.com), Zhen Cao (zhen.cao@huawei.com)
Status: Proposal
Type: Standards Track
Created: 2025-09-30
Abstract
This SEP proposes a set of optimizations to mitigate token overhead and improve tool selection efficiency in the Model Context Protocol (MCP). As the number of tools and schema complexity grow, redundant and verbose tool definitions significantly increase token consumption and reduce LLM performance. This proposal introduces four complementary mechanisms: (1) schema deduplication through JSON $ref references to eliminate redundant content across tools; (2) adaptive control of optional schema fields in tool/list responses to minimize unnecessary data transmission; (3) flexible response granularity, allowing servers to adjust output verbosity based on client intent; and (4) an embedding-based similarity matching approach for tool retrieval, limiting the number of tool descriptions returned to the LLM. Together, these mechanisms aim to reduce token bloat, enhance computational efficiency, and improve the overall accuracy and scalability of MCP-based AI agent systems.
Motivation
Compared with Chatbots, AI agents consume significantly more tokens. Besides multi-round conversations, another dominant reason is the tool invocation based on MCP. Through end-to-end analysis of the workflow of MCP, we believe that the main reason lies in:
- Redundant schema in the tool/list: There is significant repetition and redundancy in the schema of MCP tools. Tools on the same MCP server are highly likely to share several similar or even identical properties (parameters). These recurring elements are not only unnecessary but also lead to token wastage.
We have analyzed the duplicate content in the schemas of 60 tools within the official GitHub MCP server: Github-MCP-Server. The results are displayed as below:
-
The "owner" field appears in 36 tool schemas, accounting for a proportion of 60%.
"owner": { "description": "Repository owner", "type": "string" },
-
The "required" field appears in 9 tool schemas, accounting for a proportion of 15%.
"required": [ "owner", "repo" ],
-
The “repo” field appears in 39 tool schemas, accounting for a proportion of 65%.
"repo": { "description": "Repository name", "type": "string" },
The statistical results demonstrate a significant amount of overlap in schemas among different tools within the same server.
-
Ambiguous use of optional content: The standard tool/list schema defines a large amount of optional content that consumes a significant number of tokens (such as the output schema). From the perspective of the MCP server, this content is optional. However, for LLMs, once provided, it will be included in the prompt, introducing token overhead. Regarding whether these optional contents are needed, the client currently has no choice.
-
Excessively long response of the server: For some LLM-based servers, the content they generate for the host-side LLM may be excessively long. Specifically, the output schema that the content needs to adhere to is sometimes unnecessary. For example, the LLM only requires one or two simple words to describe the weather, but a complex output schema will incur additional token consumption.
-
Too many tool lists: As the server hosts an increasing number of tools, the agent’s capabilities are enhanced, but this simultaneously makes tool selection more difficult. Empirical evidence indicates that detailed tool descriptions not only lead to substantial token consumption by LLM, but also markedly reduce the accuracy of tool selection.
Specification
We would like to propose individual solutions for the issues discussed above.
- To address the issue of excessive redundancy in tool schemas across different tools, we choose to utilize the reference template schema supported by JSON, which is the $ref syntax. The $ref syntax supports two types of references:
-
Reference Local Definition: { "$ref": "#/$defs/user" }. This is an internal reference, using # to point to definitions within the current file, commonly used for reusing common structures within the same file.
-
Reference external files: { "$ref": "schema/user.json" }. By pointing to other JSON files, schemas from other files can be referenced using either relative or absolute paths.
When the output schema contains duplicate content, it can be handled through internal references:
{ "$defs": { "user": { "type": "object" } }, "properties": { "data": { "$ref": "#/$defs/user" } } }
This requires the large model to have the capability to parse JSON references, which is not a strict requirement.
Besides, the commonly used JSON schema can also be placed on the host side, and the server needs to know the JSON address on the host side. In this case, the method of external file reference is adopted. After the LLM loads the corresponding address, it places the referenced schema in the context. Subsequently, when encountering the same request, it will directly retrieve from the context, avoiding the processing of duplicate content and thereby reducing token consumption. Essentially, this transforms external file references into internal references.
-
When the server responds to a tool/list request, it should be able to determine whether to provide the optional content based on the content of the request. For example, if the client's request is a short tool/list, the server should only return necessary content without the optional content (such as the output schema). This mechanism can avoid the overhead of optional fields on the LLM's tokens.
-
The server does not always need to adhere to the output schema and return finely structured responses. The server should have the capability to return results of corresponding granularity based on the client's tool call. In other words, the server should have the capability to return results that either conform to do not conform the output schema. For instance, by adding descriptions such as "detailed" or "short" in the input schema, the server can decide to return a finely detailed result that follows the output schema or a simple, unstructured result based on the request.
-
We also aim to introduce an embedding similarity matching method on the client side for tool selection, returning a limited number of tool descriptions to alleviate the token bloat problem. Specifically, the LLM generates intents based on user’s semantics, and then the MCP server creates embeddings based on these intents.. The MCP server also generates embeddings for the tool list. In the specific process of tool embedding, the server can assign weighted descriptions to different components of the tool, thereby obtaining a weighted average of the embedding, which enhances the accuracy of tool selection. Subsequently, the MCP server performs similarity matching and returns top k tools with higher similarity scores from the tool/list to the large model. This narrows down the range of tools selected by the LLM, thereby reducing the token overhead of the LLM and improving the accuracy of tool selection.
Rationale
The rationale for this proposal is rooted in the increasing complexity of tool ecosystems within the Model Context Protocol (MCP) framework. As MCP servers continue to aggregate a growing number of tools, the cumulative schema definitions and detailed tool descriptions introduce substantial token overhead for large language models (LLMs). This problem not only affects computational efficiency and latency but also undermines the overall reliability and responsiveness of agent-based systems.
The proposed schema deduplication mechanism based on JSON $ref references is designed to address redundancy at its source. By allowing schema elements to be reused rather than duplicated, MCP servers can maintain structural consistency while significantly reducing the amount of repeated content transmitted to the LLM. This approach leverages well-established JSON Schema practices, ensuring interoperability and minimizing the need for protocol-level changes.
Adaptive control of optional schema fields offers a practical balance between flexibility and efficiency. While MCP’s extensible schema design was intended to accommodate diverse use cases, in practice, the unconditional inclusion of optional fields (e.g., detailed output schemas) results in unnecessary token consumption. Allowing servers to tailor responses according to the client’s request granularity ensures that only relevant information is transmitted, reducing overhead without sacrificing functionality.
The introduction of response granularity levels further enhances protocol efficiency by enabling LLM-based servers to align the verbosity of their responses with the semantic intent of the client. This adaptive behavior ensures that lightweight requests can be served with concise responses while preserving detailed output structures for complex tasks that require them.
Finally, the embedding-based similarity matching mechanism addresses the challenge of tool selection scalability. As the number of available tools increases, exhaustive tool enumeration becomes infeasible for LLMs due to both token limitations and cognitive load. By incorporating embedding similarity ranking, servers can pre-filter the tool set and present only the most relevant candidates to the LLM, improving both selection accuracy and inference efficiency.
Collectively, these design choices align with MCP’s overarching goals of modularity, scalability, and LLM efficiency. The proposed mechanisms do not alter the fundamental communication semantics of MCP but instead optimize its practical deployment, making it more sustainable for large-scale, real-world agent systems.
Backward Compatibility
This SEP introduces no backward incompatibilities.
Future Work
In the future, we will analyze the impact of the content in the proposal on the performance of the agent, particularly the effect of the $ref framework on the parsing performance of large models. Intuitively, large models are capable of parsing the content referenced by $ref and making reasonable tool selections based on these descriptions, but this requires further experimental verification.
Preamble
Title: Mitigating Token Bloat in MCP: Reducing Schema Redundancy and Optimizing Tool Selection
Authors: Zeze Chang (changzeze@huawei.com), Jinyang Li (lijinyang9@huawei.com), Zhen Cao (zhen.cao@huawei.com)
Status: Proposal
Type: Standards Track
Created: 2025-09-30
Abstract
This SEP proposes a set of optimizations to mitigate token overhead and improve tool selection efficiency in the Model Context Protocol (MCP). As the number of tools and schema complexity grow, redundant and verbose tool definitions significantly increase token consumption and reduce LLM performance. This proposal introduces four complementary mechanisms: (1) schema deduplication through JSON $ref references to eliminate redundant content across tools; (2) adaptive control of optional schema fields in tool/list responses to minimize unnecessary data transmission; (3) flexible response granularity, allowing servers to adjust output verbosity based on client intent; and (4) an embedding-based similarity matching approach for tool retrieval, limiting the number of tool descriptions returned to the LLM. Together, these mechanisms aim to reduce token bloat, enhance computational efficiency, and improve the overall accuracy and scalability of MCP-based AI agent systems.
Motivation
Compared with Chatbots, AI agents consume significantly more tokens. Besides multi-round conversations, another dominant reason is the tool invocation based on MCP. Through end-to-end analysis of the workflow of MCP, we believe that the main reason lies in:
We have analyzed the duplicate content in the schemas of 60 tools within the official GitHub MCP server: Github-MCP-Server. The results are displayed as below:
The "owner" field appears in 36 tool schemas, accounting for a proportion of 60%.
"owner": { "description": "Repository owner", "type": "string" },The "required" field appears in 9 tool schemas, accounting for a proportion of 15%.
"required": [ "owner", "repo" ],The “repo” field appears in 39 tool schemas, accounting for a proportion of 65%.
"repo": { "description": "Repository name", "type": "string" },The statistical results demonstrate a significant amount of overlap in schemas among different tools within the same server.
Ambiguous use of optional content: The standard tool/list schema defines a large amount of optional content that consumes a significant number of tokens (such as the output schema). From the perspective of the MCP server, this content is optional. However, for LLMs, once provided, it will be included in the prompt, introducing token overhead. Regarding whether these optional contents are needed, the client currently has no choice.
Excessively long response of the server: For some LLM-based servers, the content they generate for the host-side LLM may be excessively long. Specifically, the output schema that the content needs to adhere to is sometimes unnecessary. For example, the LLM only requires one or two simple words to describe the weather, but a complex output schema will incur additional token consumption.
Too many tool lists: As the server hosts an increasing number of tools, the agent’s capabilities are enhanced, but this simultaneously makes tool selection more difficult. Empirical evidence indicates that detailed tool descriptions not only lead to substantial token consumption by LLM, but also markedly reduce the accuracy of tool selection.
Specification
We would like to propose individual solutions for the issues discussed above.
Reference Local Definition:
{ "$ref": "#/$defs/user" }.This is an internal reference, using # to point to definitions within the current file, commonly used for reusing common structures within the same file.Reference external files:
{ "$ref": "schema/user.json" }.By pointing to other JSON files, schemas from other files can be referenced using either relative or absolute paths.When the output schema contains duplicate content, it can be handled through internal references:
{ "$defs": { "user": { "type": "object" } }, "properties": { "data": { "$ref": "#/$defs/user" } } }This requires the large model to have the capability to parse JSON references, which is not a strict requirement.
Besides, the commonly used JSON schema can also be placed on the host side, and the server needs to know the JSON address on the host side. In this case, the method of external file reference is adopted. After the LLM loads the corresponding address, it places the referenced schema in the context. Subsequently, when encountering the same request, it will directly retrieve from the context, avoiding the processing of duplicate content and thereby reducing token consumption. Essentially, this transforms external file references into internal references.
When the server responds to a tool/list request, it should be able to determine whether to provide the optional content based on the content of the request. For example, if the client's request is a short tool/list, the server should only return necessary content without the optional content (such as the output schema). This mechanism can avoid the overhead of optional fields on the LLM's tokens.
The server does not always need to adhere to the output schema and return finely structured responses. The server should have the capability to return results of corresponding granularity based on the client's tool call. In other words, the server should have the capability to return results that either conform to do not conform the output schema. For instance, by adding descriptions such as "detailed" or "short" in the input schema, the server can decide to return a finely detailed result that follows the output schema or a simple, unstructured result based on the request.
We also aim to introduce an embedding similarity matching method on the client side for tool selection, returning a limited number of tool descriptions to alleviate the token bloat problem. Specifically, the LLM generates intents based on user’s semantics, and then the MCP server creates embeddings based on these intents.. The MCP server also generates embeddings for the tool list. In the specific process of tool embedding, the server can assign weighted descriptions to different components of the tool, thereby obtaining a weighted average of the embedding, which enhances the accuracy of tool selection. Subsequently, the MCP server performs similarity matching and returns top k tools with higher similarity scores from the tool/list to the large model. This narrows down the range of tools selected by the LLM, thereby reducing the token overhead of the LLM and improving the accuracy of tool selection.
Rationale
The rationale for this proposal is rooted in the increasing complexity of tool ecosystems within the Model Context Protocol (MCP) framework. As MCP servers continue to aggregate a growing number of tools, the cumulative schema definitions and detailed tool descriptions introduce substantial token overhead for large language models (LLMs). This problem not only affects computational efficiency and latency but also undermines the overall reliability and responsiveness of agent-based systems.
The proposed schema deduplication mechanism based on JSON $ref references is designed to address redundancy at its source. By allowing schema elements to be reused rather than duplicated, MCP servers can maintain structural consistency while significantly reducing the amount of repeated content transmitted to the LLM. This approach leverages well-established JSON Schema practices, ensuring interoperability and minimizing the need for protocol-level changes.
Adaptive control of optional schema fields offers a practical balance between flexibility and efficiency. While MCP’s extensible schema design was intended to accommodate diverse use cases, in practice, the unconditional inclusion of optional fields (e.g., detailed output schemas) results in unnecessary token consumption. Allowing servers to tailor responses according to the client’s request granularity ensures that only relevant information is transmitted, reducing overhead without sacrificing functionality.
The introduction of response granularity levels further enhances protocol efficiency by enabling LLM-based servers to align the verbosity of their responses with the semantic intent of the client. This adaptive behavior ensures that lightweight requests can be served with concise responses while preserving detailed output structures for complex tasks that require them.
Finally, the embedding-based similarity matching mechanism addresses the challenge of tool selection scalability. As the number of available tools increases, exhaustive tool enumeration becomes infeasible for LLMs due to both token limitations and cognitive load. By incorporating embedding similarity ranking, servers can pre-filter the tool set and present only the most relevant candidates to the LLM, improving both selection accuracy and inference efficiency.
Collectively, these design choices align with MCP’s overarching goals of modularity, scalability, and LLM efficiency. The proposed mechanisms do not alter the fundamental communication semantics of MCP but instead optimize its practical deployment, making it more sustainable for large-scale, real-world agent systems.
Backward Compatibility
This SEP introduces no backward incompatibilities.
Future Work
In the future, we will analyze the impact of the content in the proposal on the performance of the agent, particularly the effect of the $ref framework on the parsing performance of large models. Intuitively, large models are capable of parsing the content referenced by $ref and making reasonable tool selections based on these descriptions, but this requires further experimental verification.