Skip to content

Conversation

@evalstate
Copy link
Member

@evalstate evalstate commented Mar 13, 2025

Tool Call Results allow the return of an array of Text, Image and EmbeddedResources. This is typically consistent with Messaging APIs (e.g. OpenAI, Anthropic) which allow separation of content blocks within a "User" or "Assistant" message.

The current API treats Prompt and Sampling messages as singular - e.g. they can only contain one content block. This means that client code for message handling needs to "special case" building multi-part messages by recognizing and concatenating them. This also potentially loses the semantics of the "Message" container.

Motivation and Context

  1. Consistency across schema: Currently CallToolResultSchema uses an array of content items, while PromptMessageSchema and SamplingMessageSchema use a single content item. This inconsistency creates implementation complexity.

  2. Alignment with LLM provider APIs: Modern LLM APIs like OpenAI's Chat Completions and Anthropic's Messages API support multiple content blocks per message:

response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "What's in this image?"},
            {
                "type": "image_url",
                "image_url": {
                    "url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg",
                },
            },
        ],
    }],
)
  1. Improved expressiveness: Allows for natural combinations like:
  • Text with supporting images in the same message
  • Text with embedded code snippets as separate blocks
  • Multiple resource references within a logical message unit
  1. Simplified client implementations: Eliminates the need for clients to split/join content across multiple messages to represent what is logically a single message with multiple parts.

How Has This Been Tested?

Breaking Changes

This breaking change can be mitigated with a Protocol Version check to convert from a single element to an Array.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation update

Checklist

  • I have read the MCP Documentation
  • My code follows the repository's style guidelines
  • New and existing tests pass locally
  • I have added appropriate error handling
  • I have added or updated documentation as needed

Additional context

The User Guide will need updating on publication.

@PederHP
Copy link
Member

PederHP commented Mar 14, 2025

An alternative to the breaking change could be to use a new name for the field or to add the array of content as a new type of content. Not saying either is better than a breaking change. Just worth considering, as in practice many clients/servers will likely not support multiple protocol versions, which means that non-backwards compatible schema changes will break compatibility. Maybe that's ok, but thought I mention this anyway.

@evalstate
Copy link
Member Author

I did think on this one quite hard, but I think mitigating are:

  • Relatively low take-up of Prompts/Sampling Features reduce the risk. Those using the features will likely be in a position to adapt. It would be nice to know if others had similar feedback.
  • Conversion for the general case is quite straightforward
  • A new name/field would introduce duplication and tech debt - it might make sense as a migration path, but internally I'm now coding to the assumption that Messages have multiple content blocks.

@PederHP
Copy link
Member

PederHP commented Mar 14, 2025

I did think on this one quite hard, but I think mitigating are:

  • Relatively low take-up of Prompts/Sampling Features reduce the risk. Those using the features will likely be in a position to adapt. It would be nice to know if others had similar feedback.
  • Conversion for the general case is quite straightforward
  • A new name/field would introduce duplication and tech debt - it might make sense as a migration path, but internally I'm now coding to the assumption that Messages have multiple content blocks.

I agree, but I think it makes sense to have articulated and considered the alternatives.

@evalstate
Copy link
Member Author

Well, it's put here as a draft to provoke the conversation - and get input from the Maintainers. I'm happy to put the work in to a solution of any type (compatibility preserving etc.) if we agree this is something worth doing - but there will be a lot of documentation etc. to write if we proceed with any option. Thank you.

@dsp-ant
Copy link
Member

dsp-ant commented Mar 20, 2025

Curious what @jspahrsummers and @jerome3o-anthropic have to say, but I think this approach makes sense. It'll be a bit painful for clients to update, but I think that's probably okay. Luckily the protocol is versioned and so we can deal with different result types.

@evalstate
Copy link
Member Author

On this one, I am planning on writing a discussion thread showing examples of this, and potential workarounds with sample code.

@jspahrsummers
Copy link
Member

Yep, no objections from me.

cliffhall
cliffhall previously approved these changes Apr 2, 2025
Copy link
Member

@cliffhall cliffhall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! 👍

@cliffhall
Copy link
Member

cliffhall commented Apr 10, 2025

Well, it's put here as a draft to provoke the conversation - and get input from the Maintainers.

Here's a possible alternative: The content field could be one of the types OR an array of them.

"content": {
    "anyOf": [
        {
            "$ref": "#/definitions/TextContent"
            ...
        },
        {
            "type": "array",
            "items": {
                "anyOf": [
                    {
                        "$ref": "#/definitions/TextContent"
                        ...
                    }
                ]
            }
        }
    ]
}

Probably not the right solution, but I thought I'd throw it out there.

Makes consuming the content more complex since you have to account for the either/or. And devs who are already using sampling would still need to update their code.

Realizing its essentially @PederHP's suggestion:

add the array of content as a new type of content.

@theobjectivedad
Copy link

theobjectivedad commented May 11, 2025

I appreciate this discussion and just wanted to weigh regarding use cases. I can think of two scenarios where this would be useful. As previously mentioned, (a) when sampling responses are asking for n>1 is certainly valid. I also wanted to add a (perhaps?) more common scenario where (b) MCP servers need to run multiple sampling requests in parallel.

I ran into this yesterday when I wanted to run parallel summary requests at the MCP server level on a list of search results. For this specific situation, I can certainly summarize at the client level however I feel strongly that enough new scenarios will continue to arise for (a) and (b) over time to justify a protocol change.

@evalstate evalstate marked this pull request as ready for review May 25, 2025 20:25
Copy link

@ktwillcode ktwillcode left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Member

@cliffhall cliffhall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few comments.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't be modifying this past schema version.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't be modifying this past schema version.

export interface SamplingMessage {
role: Role;
content: TextContent | ImageContent | AudioContent;
content: (TextContent | ImageContent | AudioContent)[];
Copy link
Member

@cliffhall cliffhall Jun 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One last-ditch ask for a backward compatible way to handle this. Defining an ArrayContent type which can contain any of the existing types, and then this could become:

Suggested change
content: (TextContent | ImageContent | AudioContent)[];
content: (TextContent | ImageContent | AudioContent | ArrayContent);

Not certain if there's a reason why it wouldn't work, but thought I'd put it out there.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the concern, however the original proposal is still my preference. My reasoning is:

  • Changing content to an Array makes it the same as content in CallToolResult.
  • SDK compatibility for both Client and Server is quite straightforward. E.g. converting from content to [content] or from [content,content] to [Message,Message]. This should mean the rollout at both the SDK and protocol level can be managed. There is example code in fast-agent that does this conversion (as it uses this type internally).
  • It is more directly expressive. For example mcp-webcam development version has image prompts for ICL. These have to be [Message TextContent],[Message ImageContent] rather than the actual LLM API Shape of Message [TextContent,ImageContent].
  • Adoption of Sampling and Prompts containing embedded content is still relatively small in comparison to the broader MCP system, so this will be nowhere near as impactful as a breaking change on CallToolResult.

So the trade-off is between a new type introduced for backwards compatibility, or expressing the Message content semantically. I think because it can be mitigated at the SDK with low effort I fall towards the latter.

Copy link
Member Author

@evalstate evalstate left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the concern, however the original proposal is still my preference. My reasoning is:

  • Changing content to an Array makes it the same as content in CallToolResult.
  • SDK compatibility for both Client and Server is quite straightforward. E.g. converting from content to [content] or from [content,content] to [Message,Message]. This should mean the rollout at both the SDK and protocol level can be managed. There is example code in fast-agent that does this conversion (as it uses this type internally).
  • It is more directly expressive. For example mcp-webcam development version has image prompts for ICL. These have to be [Message TextContent],[Message ImageContent] rather than the actual LLM API Shape of Message [TextContent,ImageContent].
  • Adoption of Sampling and Prompts containing embedded content is still relatively small in comparison to the broader MCP system, so this will be nowhere near as impactful as a breaking change on CallToolResult.

So the trade-off is between a new type introduced for backwards compatibility, or expressing the Message content semantically. I think because it can be mitigated at the SDK with low effort I fall towards the latter.

export interface SamplingMessage {
role: Role;
content: TextContent | ImageContent | AudioContent;
content: (TextContent | ImageContent | AudioContent)[];
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the concern, however the original proposal is still my preference. My reasoning is:

  • Changing content to an Array makes it the same as content in CallToolResult.
  • SDK compatibility for both Client and Server is quite straightforward. E.g. converting from content to [content] or from [content,content] to [Message,Message]. This should mean the rollout at both the SDK and protocol level can be managed. There is example code in fast-agent that does this conversion (as it uses this type internally).
  • It is more directly expressive. For example mcp-webcam development version has image prompts for ICL. These have to be [Message TextContent],[Message ImageContent] rather than the actual LLM API Shape of Message [TextContent,ImageContent].
  • Adoption of Sampling and Prompts containing embedded content is still relatively small in comparison to the broader MCP system, so this will be nowhere near as impactful as a breaking change on CallToolResult.

So the trade-off is between a new type introduced for backwards compatibility, or expressing the Message content semantically. I think because it can be mitigated at the SDK with low effort I fall towards the latter.

@evalstate
Copy link
Member Author

Yes, I understand it is a breaking change, and was proposed as such. Given the changes for StructuredOutput for the next protocol revision I'm not sure that this is worse (as there is a non-breaking SDK interface path to introduce it).

@evalstate evalstate requested review from dsp-ant and pcarleton June 4, 2025 08:30
@evalstate evalstate mentioned this pull request Jun 4, 2025
8 tasks
@dsp-ant dsp-ant moved this from Draft to Consulting in Standards Track Jun 6, 2025
@dsp-ant
Copy link
Member

dsp-ant commented Jun 6, 2025

Claude suggested that several documentation files in docs/ need updates to reflect the breaking changes in this PR:

Files needing updates:

  • docs/docs/concepts/sampling.mdx - Message format examples show content as single object instead of array
  • docs/tutorials/building-a-client-node.mdx - Client tutorial examples use old single content format
  • docs/sdk/java/mcp-server.mdx - Java SDK sampling examples need to use content arrays

Why: Since this changes message content from content: {...} to content: [{...}], the documentation examples will mislead developers and cause implementation errors.

Copy link
Member

@dsp-ant dsp-ant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I am okay with this change.

Please update the documentation and changelog. Please run this past SDK maintainers ASAP to understand any concerns before we land the revision.

Ping me when you need final approval.


## Other schema changes

- PromptMessage and SamplingMessage now contain Arrays of content.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be a major change with a note that it's breaking

@github-project-automation github-project-automation bot moved this from Consulting to In Review in Standards Track Jun 6, 2025
@dsp-ant dsp-ant added this to the DRAFT 2025-06-XX milestone Jun 6, 2025
@dsp-ant
Copy link
Member

dsp-ant commented Jun 10, 2025

Okay coming back to this. I think we, while we are all happy with the change, on the SDK side this is a true test for how we handle version negotiation and it revealed that we need much more work and coordination on this. While it is quite annoying for everyone involved, I believe it's best if we not include it in this revision and give SDK developers a chance to figure out how to best deal with different versions of an interface in their SDK.

@ochafik
Copy link
Contributor

ochafik commented Sep 26, 2025

I'm slightly worried about allowing message content array w/o requiring a strict message role alternance.
And very worried about the breaking change.

Most inference APIs (OpenAI's chat completions, Claude's, but also OSS in HF transformers and llama.cpp) require or assume a strict assistant / user alternance in messages, with message content being a single string or an array of typed parts.

The current sampling API amounts to flattened version of this & allows consecutive repeated roles, but is currently trivial and unambiguous to unflatten, by just grouping by role:

// Sampling messages

[
  {"role": "user", "content": {"type": "text", "text": "Describe and enhance this pic:"}},
  {"role": "user", "content": {"type": "image", "mimeType": "image/png", "data": "base64..."}},
  {"role": "assistant", "content": {"type": "text", "text": "It's dull. I've spiced it up"}},
  {"role": "assistant", "content": {"type": "image", "mimeType": "image/png", "data": "base64..."}},
  {"role": "user", "content": {"type": "text", "text": "And then?"}}
]

Converted to OpenAI / HF-style format (content: string | ({type: "text", text: string} | ...)[]):

// OpenAI- / HF-style messages

[
  {"role": "user", "content": [
    {"type": "text", "text": "Describe and enhance this pic:"},
    {"type": "image", "mimeType": "image/png", "data": "base64..."}
  ]},
  {"role": "assistant", "content": [
    {"type": "text", "text": "It's dull. I've spiced it up"},
    {"type": "image", "mimeType": "image/png", "data": "base64..."}
  ]},
  {"role": "user", "content": {"type": "text", "text": "And then?"}}
]

Now if we allow this:

[
  {"role": "user", "content": [{"type": "text", "text": "content1.1"}, {"type": "text", "text": "content1.2"}]},
  {"role": "user", "content": [{"type": "text", "text": "content2"}]}
]

The only way to implement it w/ actual inference APIs will be to coalesce these, loosing the kinda-implied semantic grouping of content1.1 and content1.2:

[
  {"role": "user", "content": [
    {"type": "text", "text": "content1.1"},
    {"type": "text", "text": "content1.2"},
    {"type": "text", "text": "content2"}
  ]}
]

My take is we should:

  • Have content accept a single MessageContent or an array of it, to avoid backwards-incompatibility:

    type MessageContent = TextContent | ImageContent | AudioContent | EmbeddedResource;
    export interface PromptMessage {
      role: Role;
      content: MessageContent | MessageContent[];
    }
  • Introduce backward-compatible message role alternance: maybe something like:

    Consecutive sub-sequences of messages with the same role MUST either all have a content with a single MessageContent, or be of length 1.

@PederHP
Copy link
Member

PederHP commented Sep 26, 2025

Most inference APIs (OpenAI's chat completions, Claude's, but also OSS in HF transformers and llama.cpp) require or assume a strict assistant / user alternance in messages, with message content being a single string or an array of typed parts.

This is no longer the case. OpenAI and Claude both allow arbitrary ordering, and I think Gemini does too.

If a client has a need for strict turn ordering they can insert dummy message or merge consecutive user / assistant messages. This is a relatively trivial change to make in those few host who need it (probably only open source inference), and it avoids a lot of complexity in the protocol.

@evalstate
Copy link
Member Author

This is no longer the case. OpenAI and Claude both allow arbitrary ordering, and I think Gemini does too.

Came here to say same (I explicitly test fast-agent handling of that case as well). I did previously used to have to divide document blocks out for Anthropic but that's been fixed too: https://github.com/anthropics/anthropic-sdk-python/releases/tag/v0.66.0 :)

I agree with @PederHP that concerns about presentation to the inference API should be handled by the Host, and that the intermediate MCP format should allow maximum expressiveness.

@evalstate
Copy link
Member Author

#1577 partially solves this, new SEP to be considered for PromptMessage types.

@evalstate evalstate closed this Nov 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

No open projects
Status: Consulting

Development

Successfully merging this pull request may close these issues.

8 participants