-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Allow Prompt/Sampling Messages to contain multiple content blocks. #198
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow Prompt/Sampling Messages to contain multiple content blocks. #198
Conversation
|
An alternative to the breaking change could be to use a new name for the field or to add the array of content as a new type of content. Not saying either is better than a breaking change. Just worth considering, as in practice many clients/servers will likely not support multiple protocol versions, which means that non-backwards compatible schema changes will break compatibility. Maybe that's ok, but thought I mention this anyway. |
|
I did think on this one quite hard, but I think mitigating are:
|
I agree, but I think it makes sense to have articulated and considered the alternatives. |
|
Well, it's put here as a draft to provoke the conversation - and get input from the Maintainers. I'm happy to put the work in to a solution of any type (compatibility preserving etc.) if we agree this is something worth doing - but there will be a lot of documentation etc. to write if we proceed with any option. Thank you. |
|
Curious what @jspahrsummers and @jerome3o-anthropic have to say, but I think this approach makes sense. It'll be a bit painful for clients to update, but I think that's probably okay. Luckily the protocol is versioned and so we can deal with different result types. |
|
On this one, I am planning on writing a discussion thread showing examples of this, and potential workarounds with sample code. |
|
Yep, no objections from me. |
cliffhall
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! 👍
Here's a possible alternative: The content field could be one of the types OR an array of them. "content": {
"anyOf": [
{
"$ref": "#/definitions/TextContent"
...
},
{
"type": "array",
"items": {
"anyOf": [
{
"$ref": "#/definitions/TextContent"
...
}
]
}
}
]
}Probably not the right solution, but I thought I'd throw it out there. Makes consuming the content more complex since you have to account for the either/or. And devs who are already using sampling would still need to update their code. Realizing its essentially @PederHP's suggestion:
|
|
I appreciate this discussion and just wanted to weigh regarding use cases. I can think of two scenarios where this would be useful. As previously mentioned, (a) when sampling responses are asking for I ran into this yesterday when I wanted to run parallel summary requests at the MCP server level on a list of search results. For this specific situation, I can certainly summarize at the client level however I feel strongly that enough new scenarios will continue to arise for (a) and (b) over time to justify a protocol change. |
ktwillcode
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
cliffhall
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a few comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We shouldn't be modifying this past schema version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We shouldn't be modifying this past schema version.
| export interface SamplingMessage { | ||
| role: Role; | ||
| content: TextContent | ImageContent | AudioContent; | ||
| content: (TextContent | ImageContent | AudioContent)[]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One last-ditch ask for a backward compatible way to handle this. Defining an ArrayContent type which can contain any of the existing types, and then this could become:
| content: (TextContent | ImageContent | AudioContent)[]; | |
| content: (TextContent | ImageContent | AudioContent | ArrayContent); |
Not certain if there's a reason why it wouldn't work, but thought I'd put it out there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand the concern, however the original proposal is still my preference. My reasoning is:
- Changing content to an Array makes it the same as content in CallToolResult.
- SDK compatibility for both Client and Server is quite straightforward. E.g. converting from
contentto[content]or from[content,content]to[Message,Message]. This should mean the rollout at both the SDK and protocol level can be managed. There is example code infast-agentthat does this conversion (as it uses this type internally). - It is more directly expressive. For example
mcp-webcamdevelopment version has image prompts for ICL. These have to be[Message TextContent],[Message ImageContent]rather than the actual LLM API Shape ofMessage [TextContent,ImageContent]. - Adoption of Sampling and Prompts containing embedded content is still relatively small in comparison to the broader MCP system, so this will be nowhere near as impactful as a breaking change on
CallToolResult.
So the trade-off is between a new type introduced for backwards compatibility, or expressing the Message content semantically. I think because it can be mitigated at the SDK with low effort I fall towards the latter.
evalstate
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand the concern, however the original proposal is still my preference. My reasoning is:
- Changing content to an Array makes it the same as content in CallToolResult.
- SDK compatibility for both Client and Server is quite straightforward. E.g. converting from
contentto[content]or from[content,content]to[Message,Message]. This should mean the rollout at both the SDK and protocol level can be managed. There is example code infast-agentthat does this conversion (as it uses this type internally). - It is more directly expressive. For example
mcp-webcamdevelopment version has image prompts for ICL. These have to be[Message TextContent],[Message ImageContent]rather than the actual LLM API Shape ofMessage [TextContent,ImageContent]. - Adoption of Sampling and Prompts containing embedded content is still relatively small in comparison to the broader MCP system, so this will be nowhere near as impactful as a breaking change on
CallToolResult.
So the trade-off is between a new type introduced for backwards compatibility, or expressing the Message content semantically. I think because it can be mitigated at the SDK with low effort I fall towards the latter.
| export interface SamplingMessage { | ||
| role: Role; | ||
| content: TextContent | ImageContent | AudioContent; | ||
| content: (TextContent | ImageContent | AudioContent)[]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand the concern, however the original proposal is still my preference. My reasoning is:
- Changing content to an Array makes it the same as content in CallToolResult.
- SDK compatibility for both Client and Server is quite straightforward. E.g. converting from
contentto[content]or from[content,content]to[Message,Message]. This should mean the rollout at both the SDK and protocol level can be managed. There is example code infast-agentthat does this conversion (as it uses this type internally). - It is more directly expressive. For example
mcp-webcamdevelopment version has image prompts for ICL. These have to be[Message TextContent],[Message ImageContent]rather than the actual LLM API Shape ofMessage [TextContent,ImageContent]. - Adoption of Sampling and Prompts containing embedded content is still relatively small in comparison to the broader MCP system, so this will be nowhere near as impactful as a breaking change on
CallToolResult.
So the trade-off is between a new type introduced for backwards compatibility, or expressing the Message content semantically. I think because it can be mitigated at the SDK with low effort I fall towards the latter.
|
Yes, I understand it is a breaking change, and was proposed as such. Given the changes for StructuredOutput for the next protocol revision I'm not sure that this is worse (as there is a non-breaking SDK interface path to introduce it). |
…state/specification into feat/message-content-arrays
|
Claude suggested that several documentation files in docs/ need updates to reflect the breaking changes in this PR: Files needing updates:
Why: Since this changes message content from |
dsp-ant
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I am okay with this change.
Please update the documentation and changelog. Please run this past SDK maintainers ASAP to understand any concerns before we land the revision.
Ping me when you need final approval.
|
|
||
| ## Other schema changes | ||
|
|
||
| - PromptMessage and SamplingMessage now contain Arrays of content. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be a major change with a note that it's breaking
|
Okay coming back to this. I think we, while we are all happy with the change, on the SDK side this is a true test for how we handle version negotiation and it revealed that we need much more work and coordination on this. While it is quite annoying for everyone involved, I believe it's best if we not include it in this revision and give SDK developers a chance to figure out how to best deal with different versions of an interface in their SDK. |
|
I'm slightly worried about allowing message content array w/o requiring a strict message role alternance. Most inference APIs (OpenAI's chat completions, Claude's, but also OSS in HF transformers and llama.cpp) require or assume a strict The current sampling API amounts to flattened version of this & allows consecutive repeated roles, but is currently trivial and unambiguous to unflatten, by just grouping by role: // Sampling messages [
{"role": "user", "content": {"type": "text", "text": "Describe and enhance this pic:"}},
{"role": "user", "content": {"type": "image", "mimeType": "image/png", "data": "base64..."}},
{"role": "assistant", "content": {"type": "text", "text": "It's dull. I've spiced it up"}},
{"role": "assistant", "content": {"type": "image", "mimeType": "image/png", "data": "base64..."}},
{"role": "user", "content": {"type": "text", "text": "And then?"}}
]Converted to OpenAI / HF-style format ( // OpenAI- / HF-style messages [
{"role": "user", "content": [
{"type": "text", "text": "Describe and enhance this pic:"},
{"type": "image", "mimeType": "image/png", "data": "base64..."}
]},
{"role": "assistant", "content": [
{"type": "text", "text": "It's dull. I've spiced it up"},
{"type": "image", "mimeType": "image/png", "data": "base64..."}
]},
{"role": "user", "content": {"type": "text", "text": "And then?"}}
]Now if we allow this: [
{"role": "user", "content": [{"type": "text", "text": "content1.1"}, {"type": "text", "text": "content1.2"}]},
{"role": "user", "content": [{"type": "text", "text": "content2"}]}
]The only way to implement it w/ actual inference APIs will be to coalesce these, loosing the kinda-implied semantic grouping of [
{"role": "user", "content": [
{"type": "text", "text": "content1.1"},
{"type": "text", "text": "content1.2"},
{"type": "text", "text": "content2"}
]}
]My take is we should:
|
This is no longer the case. OpenAI and Claude both allow arbitrary ordering, and I think Gemini does too. If a client has a need for strict turn ordering they can insert dummy message or merge consecutive user / assistant messages. This is a relatively trivial change to make in those few host who need it (probably only open source inference), and it avoids a lot of complexity in the protocol. |
Came here to say same (I explicitly test fast-agent handling of that case as well). I did previously used to have to divide document blocks out for Anthropic but that's been fixed too: https://github.com/anthropics/anthropic-sdk-python/releases/tag/v0.66.0 :) I agree with @PederHP that concerns about presentation to the inference API should be handled by the Host, and that the intermediate MCP format should allow maximum expressiveness. |
|
#1577 partially solves this, new SEP to be considered for PromptMessage types. |
Tool Call Results allow the return of an array of Text, Image and EmbeddedResources. This is typically consistent with Messaging APIs (e.g. OpenAI, Anthropic) which allow separation of content blocks within a "User" or "Assistant" message.
The current API treats Prompt and Sampling messages as singular - e.g. they can only contain one content block. This means that client code for message handling needs to "special case" building multi-part messages by recognizing and concatenating them. This also potentially loses the semantics of the "Message" container.
Motivation and Context
Consistency across schema: Currently CallToolResultSchema uses an array of content items, while PromptMessageSchema and SamplingMessageSchema use a single content item. This inconsistency creates implementation complexity.
Alignment with LLM provider APIs: Modern LLM APIs like OpenAI's Chat Completions and Anthropic's Messages API support multiple content blocks per message:
How Has This Been Tested?
Breaking Changes
This breaking change can be mitigated with a Protocol Version check to convert from a single element to an Array.
Types of changes
Checklist
Additional context
The User Guide will need updating on publication.