Skip to content

Conversation

@siwachabhi
Copy link
Contributor

PR Description: Draft for Elicitation Feature

This PR implements the elicitation feature for MCP, enabling servers to request additional information from users through the client. This feature was identified as a key improvement for MCP-based agents in #111 , #314 .

Motivation and Context

Many interactive workflows require servers to dynamically request additional information from users during execution. Examples include:

  1. Confirmation of critical actions
  2. Redirecting user to a sign-in flow
  3. Clarification of ambiguous requests
  4. Progressive form filling
  5. Contextual information gathering

Until now, MCP lacked a standardized way for servers to request this information, requiring developers to implement custom solutions or multi-step tool calls. This feature provides a clean, consistent protocol for these interactions, completing the bidirectional communication path between servers and users.

How Has This Been Tested?

The implementation has been validated through documentation review and schema consistency checks. The design follows the established patterns of MCP, particularly mirroring the array-based response approach used in tool calls. The implementation deliberately keeps the feature simple while providing extensibility for future enhancements.

Breaking Changes

None. This is a new feature that adds capabilities without changing existing functionality.

Types of changes

[x] New feature (non-breaking change which adds functionality)
[ ] Bug fix (non-breaking change which fixes an issue)
[ ] Breaking change (fix or feature that would cause existing functionality to change)
[x] Documentation update

Checklist

[x] I have read the MCP Documentation
[x] My code follows the repository's style guidelines
[x] New and existing tests pass locally
[x] I have added appropriate error handling
[x] I have added or updated documentation as needed

Additional context

Implementation Details

The implementation follows a minimalist approach with a simple request/response pattern:

  1. ElicitRequest - Server sends a message to be presented to the user
  2. ElicitResult - Client returns an array of content items representing the user's response
    The array-based response design provides flexibility for clients to return multiple content items of different types (text, image, audio), similar to tool call results. This allows for rich responses like text with accompanying images.

Protocol

To request information from a user, servers send an elicitation/create request:

Request:

{
  "jsonrpc": "2.0",
  "id": 1,
  "method": "elicitation/create",
  "params": {
    "message": "Please provide your GitHub username"
  }
}

Response:

{
  "jsonrpc": "2.0",
  "id": 1,
  "result": {
    "content": [
      {
        "type": "text",
        "text": "octocat"
      }
    ]
  }
}

Request with multiple responses:

{
  "jsonrpc": "2.0",
  "id": 2,
  "method": "elicitation/create",
  "params": {
    "message": "What is your favorite color?"
  }
}

Response:

{
  "jsonrpc": "2.0",
  "id": 2,
  "result": {
    "content": [
      {
        "type": "text",
        "text": "Blue"
      },
      {
        "type": "image",
        "data": "base64-encoded-image-data",
        "mimeType": "image/jpeg"
      }
    ]
  }
}

Message Flow

sequenceDiagram
    participant Server
    participant Client
    participant User
    
    Note over Server,Client: Server initiates elicitation
    Server->>Client: elicitation/create
    
    Note over Client,User: Human interaction
    Client->>User: Present elicitation UI
    User-->>Client: Provide requested information
    
    Note over Server,Client: Complete request
    Client-->>Server: Return user response
    
    Note over Server: Continue processing with new information
Loading

Future Considerations

While the current implementation is intentionally simple, future improvements could include:

  1. More sophisticated schema-based form generation
  2. Support for more complex interaction patterns
  3. Enhanced validation capabilities
  4. These can be considered in future revisions as real-world usage patterns emerge.

@ibuildthecloud
Copy link

I'd very much want to use this feature to request sensitive information from the user, as such a "sensitive" boolean on the request would be necessary. Would there be any issue with that?

Copy link
Contributor

@ihrpr ihrpr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for working on this!
The current ElicitationRequest using just a text message is quite limited. We need to support structured input schemas so tools can request multiple pieces of information in a form-like manner.

We need to support common use cases like:

  • Requesting multiple fields
  • Type-specific inputs (numbers, dates, booleans)
  • Validation requirements
  • Optional vs required fields

Suggested changes

Add an requestedSchema field:

export interface ElicitRequest extends Request {
  method: "elicitation/create";
  params: {
    message: string;
    requestedSchema?: JSONSchema;  // Describes expected response structure
  };
}

export interface ElicitResult extends Result {
  content: unknown;  // Validated against inputSchema
}

Benefits

  1. Better UX - Clients can generate proper form UIs
  2. Type safety - Responses validated against schema
  3. Backward compatible - Works without schema for simple text

Example

{
  method: "elicitation/create",
  params: {
    message: "Some additional details needed to configure a reminder",
    requestedSchema: {
      type: "object",
      properties: {
        title: { type: "string" },
        date: { type: "string", format: "date" },
        time: { type: "string", format: "time" },
        priority: { 
          type: "string", 
          enum: ["high", "medium", "low"] 
        }
      },
      required: ["title", "date"]
    }
  }
}

If the user cancels, the client should return a standard JSON-RPC error response instead of a successful result with a cancelled flag.

Please can you make changes on top of the draft version of the spec instead of 2025-03-06?

@siwachabhi
Copy link
Contributor Author

siwachabhi commented Apr 30, 2025

Thanks @ihrpr ,

  1. I had the requestSchema implemented locally, then punted it hoping RFC: add tool outputSchema and DataContent type to support structured content #356 will close meanwhile and we can have a consistent meaning of when server/client is requesting/prescribing a structured output.
  2. On cancelled flag, yeah we can re-use error used for sampling: https://modelcontextprotocol.io/specification/2025-03-26/client/sampling#error-handling
  3. Use draft version as base -> yes, thanks for pointing out

@siwachabhi
Copy link
Contributor Author

@ibuildthecloud, could you please describe more on the sensitive boolean use-case, how behavior changes between sensitive and non-sensitive? Likely client can decide how they want to render elicitation request to users.

@siwachabhi siwachabhi requested a review from ihrpr April 30, 2025 04:21
@ihrpr
Copy link
Contributor

ihrpr commented Apr 30, 2025

Thanks @ihrpr ,

  1. I had the requestSchema implemented locally, then punted it hoping RFC: add tool outputSchema and DataContent type to support structured content #356 will close meanwhile and we can have a consistent meaning of when server/client is requesting/prescribing a structured output.

@siwachabhi, thank you

Sure, let's work on outputSchema discussion and include it in this spec changes

@ihrpr
Copy link
Contributor

ihrpr commented Apr 30, 2025

Thanks @ihrpr ,

  1. I had the requestSchema implemented locally, then punted it hoping RFC: add tool outputSchema and DataContent type to support structured content #356 will close meanwhile and we can have a consistent meaning of when server/client is requesting/prescribing a structured output.

@siwachabhi, thank you

Sure, let's work on outputSchema discussion and include it in this spec changes

After reviewing #356, I believe it's very tool-specific discussion. Since we're not constrained by backward compatibility issues in elicitation, we have the flexibility to adopt the approach that we find most appropriate, clean, and easy to follow. Therefore, I recommend adding the requestSchema to this PR.

Copy link
Member

@cliffhall cliffhall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apart from @ihrpr's remarks about requestSchema, this feature looks great. It fills an actual void in the protocol without heaping on that much complexity. LGTM! 👍

@jonathanhefner
Copy link
Member

I've been thinking about agentic workflows and asynchronous workflows in general, and I think it might be beneficial to model elicitation as a tool result rather than something a tool does.

In an agentic workflow, there might be an agent several layers deep in the flow that needs to elicit input. Perhaps that elicitation needs to bubble up to the user, or perhaps it could be fulfilled by an intermediary agent.

Even in a non-agentic workflow, the user may want to suspend their session at the point of elicitation. For example, a tool elicits confirmation ("Are you sure?"), but the user doesn't want to answer yet, so they shut down the client / server until they've made a decision. When they restart the client / server, it should be possible to resume the session and respond to the elicitation.

Notably, the web has a way to handle the latter use case: forms. So, in the same way that getting a form is separate from submitting a form, what if elicitation was separate from fulfillment?

Specifically, what if tools could return a prompt and a pointer to another tool. The prompt would describe the desired input, and the inputSchema of the other tool would specify the shape of the input data. When the input data is ready, it would be passed to the other tool in order to "resume" the action.

The client would be responsible for rendering an interface that maps to the other tool's inputSchema. We could possibly define a new tool annotation to help with that (e.g. an annotation that points to one or more prerendered UIs that the client could use depending on its capabilities).

@cliffhall
Copy link
Member

In an agentic workflow, there might be an agent several layers deep in the flow that needs to elicit input. Perhaps that elicitation needs to bubble up to the user, or perhaps it could be fulfilled by an intermediary agent.

Regardless of how many "layers deep" you are in an agentic workflow, there's one human. The host app that sets the workflow in motion should be able to present any elicitation request directly to them.

Even in a non-agentic workflow, the user may want to suspend their session at the point of elicitation. For example, a tool elicits confirmation ("Are you sure?"), but the user doesn't want to answer yet, so they shut down the client / server until they've made a decision. When they restart the client / server, it should be possible to resume the session and respond to the elicitation.

Resumability of the streamable-http transport should support this out of box.

Notably, the web has a way to handle the latter use case: forms. So, in the same way that getting a form is separate from submitting a form, what if elicitation was separate from fulfillment?

Specifically, what if tools could return a prompt and a pointer to another tool. The prompt would describe the desired input, and the inputSchema of the other tool would specify the shape of the input data. When the input data is ready, it would be passed to the other tool in order to "resume" the action.

The client would be responsible for rendering an interface that maps to the other tool's inputSchema. We could possibly define a new tool annotation to help with that (e.g. an annotation that points to one or more prerendered UIs that the client could use depending on its capabilities).

I'm not sure making the operation of tools more complicated is the answer. We have a feature that lets the server send a direct request to 'sample' an LLM. This is the same, just sampling the human. It follows an already established pattern for requesting input from the world and could operate mostly the same way.

@siwachabhi
Copy link
Contributor Author

siwachabhi commented Apr 30, 2025

+1 @cliffhall

I had similar thought as you @jonathanhefner here: #314, but discussing with other folks in community and trying few things out, settled on introducing a new server request, that would align more with MCP mental model.

Above response @cliffhall has, is essentially I was also about to add, there is value in keeping tools simpler and this is still a two way door/forward compatible in future, if some workflow semantics has to be added on tool.

@siwachabhi
Copy link
Contributor Author

Thanks @ihrpr , that confirms, and to get your opinion on content in result, it could be an open ended field. Like you have described, I agree its simpler, only consistency was a concern, so thinking through that a little more but these don't have to same things.

export interface ElicitResult extends Result {
  content: unknown;  // Validated against inputSchema
}

@ihrpr
Copy link
Contributor

ihrpr commented May 1, 2025

Thanks @ihrpr , that confirms, and to get your opinion on content in result, it could be an open ended field. Like you have described, I agree its simpler, only consistency was a concern, so thinking through that a little more but these don't have to same things.

export interface ElicitResult extends Result {
  content: unknown;  // Validated against inputSchema
}

yeah, same as parameters for tools, they are [key: string]: unknown; and are validated during the runtime

@cmsparks
Copy link

cmsparks commented May 1, 2025

@ibuildthecloud, could you please describe more on the sensitive boolean use-case, how behavior changes between sensitive and non-sensitive? Likely client can decide how they want to render elicitation request to users.

Consider, an MCP Client calls a tool which requires some secret from the user. This could be credentials, payment information, etc. We don't want the secret to be a tool call parameter because that is saved in the chat context as a tool call. If my reading of this spec is right, Elicitations in their current state don't have any restrictions for the client on whether it could save the information, include the elicitation in LLM context, etc?

If we include an "is sensitive" boolean, this indicates to the client+server that it MUST NOT save/log the information (or maybe less strictly, save the information securely in a password manager for example)

In Cloudflare's MCP servers we've run into this issue in a few places. For example, this tool, which we're removing temporarily, requires database credentials: https://github.com/cloudflare/mcp-server-cloudflare/blob/599bfcf51e64faad9f43f6ad28fa05e8cbd93684/packages/mcp-common/src/tools/hyperdrive.ts#L83

Elicitations with a sensitive flag would solve our issues here.

@cliffhall
Copy link
Member

If we include an is sensitive boolean, this indicates to the client+server that it MUST NOT save/log the information (or maybe less strictly, save the information securely)

Yes, this. It should also indicate that it MUST NOT include the information in LLM context.

@wdawson
Copy link
Contributor

wdawson commented May 2, 2025

Hey @siwachabhi this PR looks awesome. I have been thinking about some similar flows, and there are definitely cases where I want the data I collect from the user not to pass through the MCP Client. Either because it's sensitive data, or I want to control the UX of the data collection rather than leave it up to the client.

Have you thought about those types of flows as well?

Also, this seems to assume support for Server-Side Events, but I think that's optional in the spec generally. I've been thinking of ways to support this type of thing without needing SSE as well.

@nbarbettini and I are working on a PR that will cover that and I'm happy to incorporate this elicitation pattern into that!

@cliffhall
Copy link
Member

Also, this seems to assume support for Server-Side Events, but I think that's optional in the spec generally. I've been thinking of ways to support this type of thing without needing SSE as well.

Like sampling requests, if the tool needs to elicit input from the user it MUST use SSE. It has to be able to send the client a request rather than the other way around. For instance, this would not be possible on SDTIO transport. Some other features like resource subscriptions also require it.

@siwachabhi siwachabhi requested a review from cliffhall May 2, 2025 22:59
@siwachabhi
Copy link
Contributor Author

siwachabhi commented May 2, 2025

I am concerned that sensitive boolean will not provide any security guarantee so at best can be a hint. Additionally the use-case being described here for collecting credentials, I find it pre-mature to bake into protocol payload, ideally it should work out of band(for example auth0 token vault: https://auth0.com/blog/mcp-and-auth0-an-agentic-match-made-in-heaven/). I see this discussion to be similar: #234, and there are clearly two sides. My target with this PR is to enable the base use-case of eliciting information from client where payloads don't contain any sensitive data.


Also, this seems to assume support for Server-Side Events, but I think that's optional in the spec generally. I've been thinking of ways to support this type of thing without needing SSE as well.

I think that would essentially become supporting workflow tools which will require tool functionality to be extended, there is a similar discussion above

@ihrpr
Copy link
Contributor

ihrpr commented May 3, 2025

Also, this seems to assume support for Server-Side Events, but I think that's optional in the spec generally. I've been thinking of ways to support this type of thing without needing SSE as well.

The protocol is transport-agnostic, and all the features are transport agnostic. Any transport layer should support bidirectional message exchange.

SSE is part of transport implementation; we use it in SSE and StreamableHttp transport, but all the features like Sampling are supported in stdio. Elicitation is a feature of the protocol and is transport agnostic.

@jonathanhefner
Copy link
Member

In an agentic workflow, there might be an agent several layers deep in the flow that needs to elicit input. Perhaps that elicitation needs to bubble up to the user, or perhaps it could be fulfilled by an intermediary agent.

Regardless of how many "layers deep" you are in an agentic workflow, there's one human. The host app that sets the workflow in motion should be able to present any elicitation request directly to them.

If an intermediary agent knows enough about a user, I think it would be reasonable to allow it to fulfill (or reject) elicitation requests on the user's behalf.

But, regardless, my point was about the elicitation bubbling up through layers of agents. How would that work with the sampling-derived proposal in this PR?

Also, how would the sampling-derived proposal handle elicitation in long-running asynchronous workflows when the user / client is offline?

Essentially, I am proposing we model elicitation as message passing in order to address these issues. Messages can be forwarded and routed and queued as necessary. A tool could return an elicitation (a message) which includes instructions about how to respond, and the subsequent reponse would cause the work to continue. A key point is that the tool returns the elicitation, and then stops running (it does not wait for a reponse).

Even in a non-agentic workflow, the user may want to suspend their session at the point of elicitation. For example, a tool elicits confirmation ("Are you sure?"), but the user doesn't want to answer yet, so they shut down the client / server until they've made a decision. When they restart the client / server, it should be possible to resume the session and respond to the elicitation.

Resumability of the streamable-http transport should support this out of box.

Could you elaborate? I'm having trouble envisioning how this would work. If the server has been shut down and restarted, how would the tool continue from the point of elicitation?

I'm not sure making the operation of tools more complicated is the answer.

Which part do you feel makes the operation of tools more complicated?

We have a feature that lets the server send a direct request to 'sample' an LLM. This is the same, just sampling the human. It follows an already established pattern for requesting input from the world and could operate mostly the same way.

Actually, I think that sampling could also benefit from being modeled as message passing. If I understand correctly, currently, sampling during a tool call requires sticky sessions, which makes it very unfriendly to certain kinds of deployments.

But that's a separate discussion! 😄 I would just like to make sure elicitation doesn't suffer from that limitation.


A few things I didn't mention in my original comment:

  • There should be some way to bundle state with the elicitation, which would then be passed to the tool that handles the response. I can think of a few different ways to model that. I would be very happy to discuss them and other possibilities, but this reply is already long enough.
  • With the message passing model, there is no need for an explicit cancellation mechanism. The client can just not respond to the elicitation.
  • There is an opportunity for a tool to return multiple elicitations as separate items in content. That could enable some interesting workflows, e.g. "approve X task and Y task, but postpone Z task".
  • I feel like a better name for all of this is "solicitation" rather than "elicitation". 😅

@siwachabhi
Copy link
Contributor Author

siwachabhi commented May 3, 2025

Great questions, my mental model here is:

  1. Base of protocol is JSON-RPC messages: https://modelcontextprotocol.io/specification/2024-11-05#base-protocol.
  2. Requests, Responses and Notifications all build on top of messages: https://modelcontextprotocol.io/specification/2024-11-05/basic.
  3. Server may decide to halt execution when requesting elicitation. This could be immediate or at an optimal point in execution.
  4. Client if disconnects, then should use streamable http semantics: https://modelcontextprotocol.io/specification/draft/basic/transports#streamable-http, to identify which requests are pending response. Once it responds, server execution should continue from same point.
sequenceDiagram
participant User
participant Client
participant Server
participant Tool

    
Note over Client,Server: Session Establishment
Client->>+Server: POST InitializeRequest
Server->>Server: Generate Session ID
Server-->>Client: InitializeResult<br>Mcp-Session-Id: abc123
Note over Client,Server: Session ID stored by client


Note over Client,Server: Tool Execution Phase
User->>Client: Start workflow
Client->>+Server: ToolCallRequest<br>Mcp-Session-Id: abc123<br> RequestId: req123
Server->>+Tool: Execute

Tool->>Server: Needs user input

Server-->>Client: Create SSE stream with event ID<br>Content-Type: text/event-stream
Server->>-Client: SSE event: id=event789<br>ElicitRequest<br>Mcp-Session-Id: abc123<br> RequestId: req456

Server->>StateStore: Store execution state<br>Mcp-Session-Id: abc123<br> RequestIds: req123, req456
Note over Tool,Server: Tool Execution can be halted

Note over User,Client: Disconnection Scenario
Note over Client: Client stores last processed event-id
Client-xServer: Connection dropped

Note over User,Client: Later Reconnection
User->>Client: Reconnect & provide input
Client->>+Server: GET with Session ID and Last-Event-ID<br>Mcp-Session-Id: abc123<br>Last-Event-ID: event789

Server->>Server: Resume session state
Server-->>Client: 200 OK<br>Content-Type: text/event-stream
Note right of Server: Stream resumed

Note over Client,Server: Completing Elicitation
Client->>+Server: POST ElicitResult with Session ID<br>Mcp-Session-Id: abc123<br> RequestId: req456

Server->>StateStore: Retrieve execution state
StateStore-->>Server: Return state data

Server->>+Tool: Resume execution
Tool->>-Server: Complete task
Server-->>-Client: Return ToolCallResult<br>Mcp-Session-Id: abc123<br> RequestId: req123

Client->>User: Show result to user
Loading

A gap that still remains after this is if a client wants to GET state of a tool call, progress notifications partially fill that gap, but there could be also be a scenario to GET progress.

@LucaButBoring
Copy link
Contributor

a tool needs more information, rather than throwing a invalid params, it elicits the data it needs.

Regarding this case in particular, elicitation can be used within a tool call just like sampling can, but that's probably not documented clearly - also just like sampling 😔

@patwhite
Copy link
Contributor

patwhite commented Jun 9, 2025

a tool needs more information, rather than throwing a invalid params, it elicits the data it needs.

Regarding this case in particular, elicitation can be used within a tool call just like sampling can, but that's probably not documented clearly - also just like sampling 😔

Ya, I read through the full thread here and I see the long discussion with @jonathanhefner - I 100% think this was pulled too soon, none of jonathan's issues were properly addressed, this is a nightmare to implement for both clients and servers. For instance, there is NO WHERE that specifies clients are expected to process a separate event stream while they're waiting on a tool response - every client I know today waits for a tool call to complete. This is a major rework on the clients to support this. Then, the statefullness, I don't believe Jonathan's stateless model is supported by the spec today, so that's not a usable model.

And then, as to the sensitive data issue, the way oauth works is a perfect example of correctly gathering sensitive data, you send the user to an external URL to actually gather the data so there's no chance it's cached along the way. So, that + oauth seem like a critical use case that are ignore as far as I can tell.

@nbarbettini
Copy link
Contributor

And then, as to the sensitive data issue, the way oauth works is a perfect example of correctly gathering sensitive data, you send the user to an external URL to actually gather the data so there's no chance it's cached along the way. So, that + oauth seem like a critical use case that are ignore as far as I can tell.

@patwhite I'm addressing this (with @wdawson) in #475. You are correct that elicitation is defined for a different use case than OAuth/server-initiated auth escalation. That's why the Security Considerations section says Servers MUST NOT request sensitive information through elicitation. In #475 we are defining what the "out of band" path looks like.

@patwhite
Copy link
Contributor

patwhite commented Jun 9, 2025

And then, as to the sensitive data issue, the way oauth works is a perfect example of correctly gathering sensitive data, you send the user to an external URL to actually gather the data so there's no chance it's cached along the way. So, that + oauth seem like a critical use case that are ignore as far as I can tell.

@patwhite I'm addressing this (with @wdawson) in #475. You are correct that elicitation is defined for a different use case than OAuth/server-initiated auth escalation. That's why the Security Considerations section says Servers MUST NOT request sensitive information through elicitation. In #475 we are defining what the "out of band" path looks like.

Ya, I saw your comment, and I’m essentially saying these should not be separate, this pr should better contemplate tool calling elicitation and auth escalation / sensitive data. As it’s written, it is really hard for me to imagine this getting any more support than sampling, since those are really the two primary use cases for this feature

@patwhite
Copy link
Contributor

patwhite commented Jun 9, 2025

I’ll add - there are 3 primary use cases I see for elicitation, please correct me if I’m wrong

1 Server first touch - that’s the GitHub sever gathering your GitHub username when you connect, this pr has that covered.
2 Multi-turn tool interactions
3 Auth escalations (and sensitive data)

This pr handles one of the three, so we’re saying we have to build two other async messaging protocols here?

@nbarbettini
Copy link
Contributor

2 Multi-turn tool interactions

I'm curious about this - can you elaborate @patwhite?

@patwhite
Copy link
Contributor

patwhite commented Jun 9, 2025

2 Multi-turn tool interactions

I'm curious about this - can you elaborate @patwhite?

Sure, this is the use case @jonathanhefner brought up - a tool call needs additional information, so it elicits that from the user, pausing the initial tool response. I’d add, I could also see elicitation being used for long running tool calls, you ask the MCP server to render a video, it returns an ok, then at some point in the future it sends you an elicitation with a url and asking if you have any changes

@patwhite
Copy link
Contributor

patwhite commented Jun 9, 2025

Another great use case, think about an MCP server to do your taxes -it would heavily use this feature but I couldn’t figure out how it should be built from the spec here

@patwhite
Copy link
Contributor

patwhite commented Jun 9, 2025

One other quick addition after reading this - MCP is already a VERY hard protocol to scale, this model potentially requires two SSE connections, I believe this requires a new session management model which is cross SSE sessions (something explicitly not included in the spec), and full statefulness of the backend - this will just make a hard to scale protocol even harder, and ultimately this is going to hinder adoption of this as much as sampling is hindered.

And finally - I just don't see a world where you can enforce the sensitive data constraint. The only possible way is through some sort of semantic filtering, but I wouldn't be surprised if this comes out every stdio server that relies on api keys uses this to get them, its just such a nicer user experience. So, in the absence of a real solution for sensitive data gathering, this will be what folks use. Also, "sensitive" is not a universally agreed upon term, your github username can be considered sensitive if other user data is co-mingled (de-anonymizing your name for instance). So, there's a level of subjectivity in that statement, which is never good in a MUST spec clause.

@siwachabhi
Copy link
Contributor Author

Hi @patwhite, @nbarbettini and team are already driving the out of band auth discussion, the prescriptive guidance identified by auth working group is pretty useful. Additionally, they have a viable point of exploration around url specification/user agent, but exact out come might be just elicitation being extended, would wait for a consensus from steering committee. Best to discuss over: #475

MCP is already a VERY hard protocol to scale, this model potentially requires two SSE connections

I didn't follow how we got to this conclusion, it will be a single SSE stream from server to client.

Gap is the spec behavior for long running tools(also being explored in multiple discussions), specially if caller or tool doesn't know it will be long running before hand. Spec recommends server should not close SSE before completing a tool call response.

a tool call needs additional information, so it elicits that from the user, pausing the initial tool response

How to model if tool can continue to work while it has requested elicitation?

As an alternative to streamable HTTP, one could very well just implement existing transport spec as a HTTP1.1 transport where HTTP response/request body could contain json rpc response/request/notification, whole protocol doesn't need to change for that, same goes to use websockets transport. So we essentially need to iterate a bit on transport, which was out of scope of this PR.

Just to confirm we are on same understanding, if a tool is requesting some information from client, tool will have to be stateful to resume from response. An exception will be if server returns its complex state as part of response, at that point the output/input schema of tool will be specialized to represent internal state which can be externalized, and isn't that just a tool?(example: browser workflow with playwright mcp).

@patwhite
Copy link
Contributor

MCP is already a VERY hard protocol to scale, this model potentially requires two SSE connections

I didn't follow how we got to this conclusion, it will be a single SSE stream from server to client.

This was from the discussion you had with @jonathanhefner that just got dropped, but honestly is the #1 use case here - unless I misread something, the elicitation after a tool call will trigger a second SSE session getting created. That might have just been for the more stateless model, but a >12 steps process to handle post tool call elicitation is way way way too complex. Jonathan's proposal to include a tool response type of an elicitation request would deal with this very nicely, hence why I keep saying this was included too early and without fully thinking through backend implementation details.

Gap is the spec behavior for long running tools(also being explored in multiple discussions), specially if caller or tool doesn't know it will be long running before hand. Spec recommends server should not close SSE before completing a tool call response.

Yes, there's a huge gap in the spec here, this solve PART of it, but solving one small part in a vacuum doesn't make sense, this should be thought of holistically. What we're basically saying is there's a multi-turn tool calling lifecycle, that might involve pushing notifications, might involve elicitation, etc - we should solve that problem rather than taking a one of piece that makes elicitation very complex.

a tool call needs additional information, so it elicits that from the user, pausing the initial tool response

How to model if tool can continue to work while it has requested elicitation?

Elicitation as a tool response deals with this

As an alternative to streamable HTTP, one could very well just implement existing transport spec as a HTTP1.1 transport where HTTP response/request body could contain json rpc response/request/notification, whole protocol doesn't need to change for that, same goes to use websockets transport. So we essentially need to iterate a bit on transport, which was out of scope of this PR.

Again, I'll go back to the concerns that @jonathanhefner brought up and that were not addresses - the are protocol level issues with the most basic use case here for a tool call eliciting more information. If we want this feature adopted, you can't have the feature be ambiguous support by the underlying protocol itself.

Just to confirm we are on same understanding, if a tool is requesting some information from client, tool will have to be stateful to resume from response. An exception will be if server returns its complex state as part of response, at that point the output/input schema of tool will be specialized to represent internal state which can be externalized, and isn't that just a tool?(example: browser workflow with playwright mcp).

With elicitation in a tool response, it can be session aware but not necessarily stateful. Just so we're on the same page, as SOON as you escalate to an SSE connection, you have create a stateful requirement that there be a stateful singleton running on the server and if you've scaled out to multiple nodes, a message broker. There should be a model by which you do not need to upgrade to an SSE connection to make elicitation work. That's different than session aware, and something that really gets glossed over in all these MCP discussion - session aware != stateful. SSE sessions require a long lived, open connection that can be discovered by other hosts in the system when a message post comes in (in order to deliver the response). That is different than session aware where you put in redis that the last request from this session was this tool call, so when you get an elicitation response you can continue it. This is a VERY important distinction.

@LucaButBoring
Copy link
Contributor

the elicitation after a tool call will trigger a second SSE session getting created

I don't believe this is necessarily true? Elicitation implies having some sort of server->client connection to send the creation request on, but it shouldn't need to be a different one from the one used for tool calls (but it could be). As a client, you can receive a server request on the same stream that's waiting for a tool response.

There should be a model by which you do not need to upgrade to an SSE connection to make elicitation work.

This is arguably a flaw of Streamable HTTP, not of elicitation. It's something that impacts all server->client requests and notifications. It could be fixed with polling to set up a makeshift server->client stream, but I'd honestly consider that something the transport layer itself should handle, not the protocol interaction on top of it.

@siwachabhi
Copy link
Contributor Author

With elicitation in a tool response, it can be session aware but not necessarily stateful. Just so we're on the same page, as SOON as you escalate to an SSE connection, you have create a stateful requirement that there be a stateful singleton running on the server and if you've scaled out to multiple nodes, a message broker. There should be a model by which you do not need to upgrade to an SSE connection to make elicitation work. That's different than session aware, and something that really gets glossed over in all these MCP discussion - session aware != stateful.

Lets align on requirements:

  1. Server requesting elicitation will need stateful storage: at least to keep track of session-id and data associated with it
  2. Server requesting elicitation shouldn't need long-lived connection: this make deployment harder, which was the case with original SSE transport spec of MCP(It might be a good idea to review the latest spec its intended to solve this problem, but there are still few paper cuts).
  3. Server should be able to send messages to client which are not directly response to original request

If we align on this, then answer is MCP needs a resumable bi-directional transport, if current resumable HTTP transport version doesn't work then it needs to be improved. Note: https://html.spec.whatwg.org/multipage/server-sent-events.html, SSE doesn't require a long-lived connection, thats why Last-Event-Id header exists. Can it be further simplified? Yes, the return elicitation in tool response is not extensible to other Server to Client concepts in MCP and I think there will be more in future, thats the whole value add of LLMs/Agents over long term. An example of extensible approach I described above is a server should be able to return a client request in response to a server request. But this is not a HTTP standard, SSE is.

@patwhite
Copy link
Contributor

With elicitation in a tool response, it can be session aware but not necessarily stateful. Just so we're on the same page, as SOON as you escalate to an SSE connection, you have create a stateful requirement that there be a stateful singleton running on the server and if you've scaled out to multiple nodes, a message broker. There should be a model by which you do not need to upgrade to an SSE connection to make elicitation work. That's different than session aware, and something that really gets glossed over in all these MCP discussion - session aware != stateful.

Lets align on requirements:

  1. Server requesting elicitation will need stateful storage: at least to keep track of session-id and data associated with it

No one is debating stateful storage - in the context of backend development, that's just storage, that's not "statefullness". Statefullness in this context refers to the long lived code that has to be running on the node that the SSE session was initially connected to, and keeps the SSE session alive. While it maybe be technically possible to treat SSE as a polling protocol like you're describing, I don't know of a single client or server library that operates like that. SSE is always implemented as a long lived connection. That's why it was created, if you want to return single items you just use HTTP.

  1. Server requesting elicitation shouldn't need long-lived connection: this make deployment harder, which was the case with original SSE transport spec of MCP(It might be a good idea to review the latest spec its intended to solve this problem, but there are still few paper cuts).

SSE is a long lived connection, the second the server escalates a tool response to an SSE session in order to send a tool elicitation, you just made this long lived. The server could kill it immediately, that would trigger an immediate request from the client to re-establish, which you could then 204, but you wouldn't do that. You're going to have a tool response to send back as soon as the elicitation request comes back in, why would you close the session?

Just to put a finer point on it - here's the SSE spec guidance in MCP:

The server SHOULD NOT close the SSE stream before sending a JSON-RPC response per each received JSON-RPC request, unless the [session](https://modelcontextprotocol.io/specification/2025-03-26/basic/transports#session-management) expires.

So, now, imagine you're in a scaled out scenario. Let's walk through how that works for the elicitation on tool request, with two nodes behind a load balancer.

  1. Client sends tool request, this hits node 1.
  2. Server responds with an SSE upgrade and sends the elicitation response. They hold this open per the spec waiting to send back a response to the tool request
  3. Elicitation response hits node 2
  4. Depending on the implementation, node 2 either hydrates the sessions, answers to original node request, then sends a message to node 1 who delivers the response to the client. The server can then close the session. There are other ways to do this, this is just illustrative, all require messaging between nodes.

So, it is against the protocol to do what you're describing where you forcibly close the stream, then deliver the message on a different SSE channel

You can check out scaled-mcp for how we handle this, but with all the motions we've made towards statelessness in the 2025 spec, why regress it here?

So, that brings us to the final point, what would be better?

In order to build this stateless, it's a pretty small tweak. We allow tools to return an elicitcation request as a tool response, and we allow an elicitation response request to have a tool response.

That's just one idea, and I'm not even sure the best, and I could imagine a bunch of other ways we do this. But overall, I would much much prefer that instead of pushing this through in a way that really threatens elicitations adoption because of the statefullness, we have the bigger discussion about multi-turn tool calls (and the auth escalation) because they're fundamentally all part of the same pattern - you make a tool call, you need the user to do something, the user does it, then you answer the response. That's the meta flow we need to solve, and solving it 3 different times (once for this, once for tool calls, and once for auth) just seems wasteful.

  1. Server should be able to send messages to client which are not directly response to original request

My issue with this proposal takes no opinion on what the protocol should do in the future, it's how it works today.

@siwachabhi
Copy link
Contributor Author

siwachabhi commented Jun 11, 2025

it is against the protocol to do what you're describing where you forcibly close the stream

Lets raise a PR for this? Both of us are talking about same paper cuts in transport, why not improve it at the core. You're going to have a tool response to send back as soon as the elicitation request comes back in, why would you close the session And even this it should be server/client choice whether to wait for a bit before closing.

with all the motions we've made towards statelessness in the 2025 spec, why regress it here

There is still a streamable http, which supports resumable bi-directional connection. How is this a regression? Its making a paper cut in transport obvious.

it's a pretty small tweak. We allow tools to return an elicitcation request as a tool response, and we allow an elicitation response request to have a tool response.

Its not, its a one way door decision. A more two way door decision is to follow the current protocol and improve transport spec, if that doesn't work then we have to change MCP philosophy. I am not tied to being right, but I don't see any other way to make progress on this.

Also related to returning elicitation in tool result, there answer is not to chain one concept into another, we end up in a dependency hell with that one if we see concepts to be independently composable, that seems to be philosophy of MCP and protocol maintainers confirmed above(folks could you have just asked to go other route, and we would have a different outcome). Again, I am not taking it to the heart what is global optimal right answer, we won't know that in short term in LLM space, but the features should align with protocol philosophy, only then we will get to true limits of the protocol, else its a spaghetti of everything.

If human in the loop being such an important feature and doesn't get adoption(again its not same as sampling, I don't see a clear use-case for it irrespective for how one implements), then it will clearly show a pretty big gap in protocol, which will either warrant a transport spec improvement or more fundamental improvement.

That's the meta flow we need to solve, and solving it 3 different times (once for this, once for tool calls, and once for auth) just seems wasteful

It will be a single concept in my opinion, but it can't be big bang single change, we will incrementally get there.

this proposal takes no opinion on what the protocol should do in the future

That will be a transport spec update PR

@patwhite
Copy link
Contributor

it is against the protocol to do what you're describing where you forcibly close the stream

Lets raise a PR for this? Both of us are talking about same paper cuts in transport, why not improve it at the core. You're going to have a tool response to send back as soon as the elicitation request comes back in, why would you close the session And even this it should be server/client choice whether to wait for a bit before closing.

I'm open to doing some work on this, but it's a pretty fundamental change, and I'm imaging @dsp-ant had a reason to not support cross SSE messaging when they built that out, but if that's the answer from the steering committee, I'm happy to help out. This PR is missing other components - guidance to the clients that they should be watching for elicitation requests while waiting for tool responses - I believe almost all clients today are more or less entirely paused waiting for tool responses and will need to explicitly support this sort of interrupt pattern.

with all the motions we've made towards statelessness in the 2025 spec, why regress it here

There is still a streamable http, which supports resumable bi-directional connection. How is this a regression? Its making a paper cut in transport obvious.

It's a regression of the move toward statelessness. If you look at the VAST majority of Remote MCP servers, they are not implementing SSE, they are implementing just HTTP responses to tool calls. It's a regression toward progress of making the protocol less stateful and supporting most modern deployment practices (in particular, serverless deployments). Right now, deploying an mcp server that supports SSE and that scales out requires k8s or baremetal, lambdas or cloud run present very tricky challenges.

it's a pretty small tweak. We allow tools to return an elicitcation request as a tool response, and we allow an elicitation response request to have a tool response.

Its not, its a one way door decision. A more two way door decision is to follow the current protocol and improve transport spec, if that doesn't work then we have to change MCP philosophy. I am not tied to being right, but I don't see any other way to make progress on this.

One way door isn't the right way to talk about this, every change to a spec is a one way door that you'll have to support for at least several versions.

I'd offer, a better way to think about this is in terms of protocol bloat - a new response type is a MUCH smaller change than a full new bidirectional messaging exchange. But, the WORST protocol bloat is yet to come - since this proposal doesn't adequately handle the auth use case, that means there will be another bidirectional messaging protocol to do essentially the same thing that will come next version. And, since this doesn't work with stateless tool call, that will end up being included at some point. So, in terms of bloat, including this without dealing with the other two use cases will lead to the great amount of bloat, and again, because this requires statefullness, from all the evidence we've seen it's going to have trouble getting adoption.

Also related to returning elicitation in tool result, there answer is not to chain one concept into another, we end up in a dependency hell with that one if we see concepts to be independently composable, that seems to be philosophy of MCP and protocol maintainers confirmed above(folks could you have just asked to go other route, and we would have a different outcome). Again, I am not taking it to the heart what is global optimal right answer, we won't know that in short term, but the features should align with protocol philosophy, only then we will get to true limits of the protocol, else its a spaghetti of everything.

This isn't a dependency hell situation with variable multi hop dependencies etc - there are 3 very clear use cases for this type of communication that I can come up with, maybe there are a handful of others, let's gather those use cases then design something that works for the majority of them. I'm proposing we approach this like I would approach any sort of engineering effort.

If human in the loop being such an important feature and doesn't get adoption(again its not same as sampling, I don't see a clear use-case for it irrespective for how one implements), then it will clearly show a pretty big gap in protocol, which will either warrant a transport spec improvement or more fundamental improvement.

I mean, this is my fundamental issue here - you're asserting this will get adoption, I'm asserting it will have challenges. But, if we're being objective, NONE of the server to client messaging models have gotten wide adoption. What's been widely adopted is client to server requests. So, given that's our only real data point, why are we introducing yet another speculative server to client model when we could design this better?

this proposal takes no opinion on what the protocol should do in the future

That will be a transport spec update PR

I'm referring to my comments not the overall PR. I'm saying this proposal has issues today, regardless of what's coming down the pike.

@LucaButBoring
Copy link
Contributor

LucaButBoring commented Jun 12, 2025

I'm still not convinced that the SSE part of this is much of an issue — it follows nearly the same interaction flow as sampling, and I know for a fact that:

  1. Sampling within a tool call works in the TS/Java SDKs (see sample TS code in docs(sampling): rewrite sampling docs for MCP server builders #515 which actually uses stdio and is therefore stricter; Java SDK does this in SSE tests as well)
  2. Elicitation within a tool call works in the Java SDK (that's what my test cases did there)

An SSE stream isn't a blocking channel, it's multiplexed unless explicitly stated otherwise in the spec. That's what things like JSON-RPC message IDs help handle, mapping a response back to its request when we're running many overlapping requests at once on a given stream or streams. It might be worth clarifying that in the transport specification, but I don't think it's the default assumption by any means.


On auth, I think compartmentalizing that was probably the right move for now, because MITM is an inherent issue in multi-layered server setups, and that's an issue that needs to be solved in more than just this. It's not something any single feature can just give a solution for, it's something that needs to be addressed across the protocol in general. It affects regular tool calls, sampling, and elicitation, because in multi-layer setups all of those are things that can be introspected by intermediate servers. The out-of-band communication proposal comes closer to handling that within a standalone feature, but it still relies on having partial trust of the intermediate servers as of now.

@patwhite
Copy link
Contributor

I'm still not convinced that the SSE part of this is much of an issue — it follows nearly the same interaction flow as sampling, and I know for a fact that:

This is a great example - no one is arguing this is different than the sampling. The issue is both this and sampling require stateful servers (precluding easy deployment in serverless environments), which then leads to trouble scaling. It's also worth pointing out, sampling has essentially zero adoption, so, that begs the question should we really be using that as a model for new features? The number one feature by like, 99%, for adoption is tool calls, so how you elicit during tool calls should be a well thought out, easy to implement, feature.

The issue with multiple SSE connections came up because jonathan was trying to figure out if it's possible to do this statelessly, and it kinda is, but it's quite complex and breaks the protocol.

On auth, I think compartmentalizing that was probably the right move for now, because MITM is an inherent issue in multi-layered server setups, and that's an issue that needs to be solved in more than just this. It's not something any single feature can just give a solution for, it's something that needs to be addressed across the protocol in general. It affects regular tool calls, sampling, and elicitation, because in multi-layer setups all of those are things that can be introspected by intermediate servers. The out-of-band communication proposal comes closer to handling that within a standalone feature, but it still relies on having partial trust of the intermediate servers as of now.

With the auth escalation we'll now have a second model for server initiated messages asking for user input. So, I mean, it's fine, but it's just silly to approach them separately, and just bloats the protocol. The auth escalation and sensitive data acquisition issues are 100% the same thing, and should be dealt with holistically.

I'll add one final thought, then I guess this is all done. I do actually think this will get adoption, but I think the primary use case will be server connect api key elicitation. I know that's explicitly against the spec, but for someone building a server that connects to an upstream service, that will be the best user experience to gather those keys, so everyone will do it. User experience will trump spec every day of the week. It's just another reason I think sensitive data and auth should be thought of holistically here.

@LucaButBoring
Copy link
Contributor

LucaButBoring commented Jun 12, 2025

This is a great example - no one is arguing this is different than the sampling. The issue is both this and sampling require stateful servers (precluding easy deployment in serverless environments), which then leads to trouble scaling. [...] The issue with multiple SSE connections came up because jonathan was trying to figure out if it's possible to do this statelessly, and it kinda is, but it's quite complex and breaks the protocol.

Got it, this is fair, actually. I think we should look at #543 in more detail for this - I believe that statefulness isn't actually a fundamental limitation of this interaction, but rather is a limitation of how SDKs represent it. We've discussed this exact issue with respect to sampling, and the same discussions and solutions should apply here, too.

With the auth escalation we'll now have a second model for server initiated messages asking for user input. So, I mean, it's fine, but it's just silly to approach them separately, and just bloats the protocol. The auth escalation and sensitive data acquisition issues are 100% the same thing, and should be dealt with holistically.

Pretty sure we're in agreement here - for the record, I didn't mean to bring up out-of-band as a real solution, but rather as an example of yet another proposal that's trying to address it for one specific interaction pattern but makes no attempt to address it for the rest of the protocol.

(also under no illusions about how people will re-appropriate this)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.