Skip to content

Conversation

@jonathanhefner
Copy link
Member

@jonathanhefner jonathanhefner commented Jul 7, 2025

Preamble

Transport-agnostic Resumable Requests

Authors: Jonathan Hefner (jonathan@hefner.pro), Connor Peet (connor@peet.io)

Abstract

This proposal describes a transport-agnostic mechanism for resuming requests after disconnections. Using this mechanism:

  • Clients and servers can disconnect and reconnect without losing progress.
  • Servers can communicate expire-after-disconnect timeouts and reclaim resources thereafter.
  • Clients can check request status after disconnect without having to fetch undelivered messages.
  • All of the above works regardless of transport (HTTP, WebSocket, stdio, etc.).

Motivation

  • Addressing limitations of resumability when using the Streamable HTTP transport.
    • SSE-based resume requires the server to send at least one event in order for the client to obtain a Last-Event-ID. If a connection is lost before an event is sent, there is no way for the client to resume the SSE stream. This is especially problematic because the spec currently says that disconnection should not be interpreted as the client cancelling its request.
    • The spec does not indicate whether a server can delete previously missed SSE events once they have been confirmed delivered by a resume. The spec could explicitly allow this, but resuming is done via HTTP GET, and HTTP GET requests should be read-only.
    • There is no mechanism for a server to communicate that it will expire a request after a certain duration of client inactivity.
  • Extending resumability to other transports.
    • Because resumability is defined by the transport layer, the burden of creating new or custom transports is higher.
    • If each transport defines its own version of resumability, it is more difficult to develop MCP features without accounting for (or relying on) the nuances of a particular transport.
  • Enabling robust handling of long-running requests such as tool calls.
    • The spec does not allow servers to close a connection while computing a result. In other words, servers must maintain potentially long-running connections.
    • There is no mechanism for a client to check the status of a request after disconnection without having to fetch undelivered messages.

Specification

  1. If a client has advertised the resumableRequests capability, a server MAY send a notifications/requests/resumePolicy notification when responding to a request. The notification will specify the resume policy for the request in the event of disconnection, and will include a token that the client can use to resume the request.
  2. After the resume policy is sent, both the client and the server MAY disconnect at will. This allows servers to handle long-running requests without maintaining a constant connection.
  3. After a disconnection, a client can optionally send a requests/getStatus request to get the status of the original request without fetching pending messages. If the parameters of the requests/getStatus request are valid per the request policy, the server SHOULD reset policy-related timers and then return the status of the original request.
  4. After a disconnection, clients can resume the request by sending a requests/resume request with the same message ID as the original request, plus the server-issued token as a parameter. If the ID and token are valid per the resume policy, the server SHOULD reset policy-related timers, send any pending messages (e.g., progress notifications), and then continue as if it were handling the original request.
sequenceDiagram
    participant Client
    participant Server

    Client->>+Server: Request (e.g., tools/call)<br>{ id: 123, params: { ... } }

    Server-->>Client: notifications/requests/resumePolicy<br>{ params: { requestId: 123, resumeToken: "abc" } }
    loop
        Server-->>Client: Messages (e.g., notifications/progress)
    end
    Server--x-Client: Disconnection occurs

    Note over Client: Client checks request status (optional)
    opt 
        Client->>+Server: requests/getStatus<br>{ params: { requestId: 123, resumeToken: "abc" } }
        Server-->>-Client: GetRequestStatusResult
    end

    Note over Client: Client decides to resume
    Client->>+Server: requests/resume<br>{ id: 123, params: { resumeToken: "abc" } }<br>[Same `id` as original request]
    Server-->>Client: Undelivered messages
    loop
        Server-->>Client: Messages (e.g., notifications/progress)
    end
    Server-->>-Client: CallToolResult<br>{ id: 123, result: { ... } }
Loading

Rationale

The above specification addresses the issues outlined in the Motivation in the following ways:

  • The server sends notifications/requests/resumePolicy notification as soon as possible after determining a request should be resumable. This causes the Streamable HTTP transport to send a usable Last-Event-ID to the client.
  • Because a client resumes using a request ID instead of solely an event ID, there is no expectation for servers to retain messages that have been confirmed delivered. Furthermore, for the Streamable HTTP transport, requests/resume is sent via POST, not GET, allowing servers to delete delivered messages as part of the resume request.
  • The notifications/requests/resumePolicy notification includes an optional maxWait parameter, informing the client of the maximum number of seconds it may wait after a disconnection before resuming the request or checking its status. After this time has elapsed, the server MAY cancel the request and free all associated resources.
  • Because resumability is handled at the application layer via notifications/requests/resumePolicy and requests/resume, it works the same for all transports.
  • After sending a notifications/requests/resumePolicy notification, the server is allowed to disconnect at will. Thus the server is not required to maintain a long-running connection.
  • The client can use requests/getStatus to check the status of a request after disconnection without having to fetch undelivered messages.

Future Work

  • Support a callback mechanism such as webhooks.
    • A client could inform the server about a webhook via either a client capability or a _meta parameter for the request. Upon completion of the request, if the client is disconnected, the server could send the request ID to the webhook. The webhook host could then send a notification (e.g. push notification) to the client, and the client could resume the request to receive the result.
  • Use resumable requests for subscriptions.
  • Support client roaming.
    • Perhaps in the form of methods like requests/resume/all and requests/getStatus/all, or maybe something more closely integrated with sessions (e.g. a sessions/resume method).

Alternatives

  • #899: Transport-agnostic resumable streams

    This proposal is a simplified version of #899. This proposal focuses on making JSON-RPC requests resumable in a transport-agnostic way, whereas #899 proposes a more general transport-agnostic mechanism (streams).

    In terms of functionality, the two are mostly equivalent, but for this proposal, resumability is bounded by the JSON-RPC request message and response message. Thus, with this proposal, resumability cannot begin with a JSON-RPC notification, nor can it extend beyond a JSON-RPC response (whereas both of those things are possible with #899).

  • Resource-based approaches

    Resource-based approaches propose assigning a resource URL to a tool call result so that the client may read it at a later time. This requires modifying the definition of resources to accommodate the CallToolResult type, which does not have a 1-to-1 mapping with the TextResourceContents / BlobResourceContents types. It also requires modifying the definition of resources such that resources may be "not ready", which in turn impacts all existing clients and servers that use resources.

    More critically, though, resource-based approaches require distinct handling mechanisms for each message type other than CallToolResult. Fundamentally, the output of a request, such as a tool call, is a sequence of messages, even if the cardinality is 1 in many cases. If we try to represent the output as a resource, then we must define ways to handle messages that do not fit in a resource, such as progress notifications and sampling requests. Each message type that we introduce would need consideration about how it would work with "resource-ended" requests versus "normal" requests.

    A resource-based approach would increase the number of provisions the spec must make, increase the number of code paths required for implementation, and increase the potential for incompatibilities when extending the spec.

  • #650: tools/async/call vs tools/call

    #650 proposes adding a new type of tool call, tools/async/call. When a client calls a tool via tools/async/call, the server returns a CallToolAsyncResult response which includes a token. The client can then use the token to check the status of the tool call via tools/async/status, and to fetch the tool call result via tools/async/result.

    There is some overlap between #650 and this proposal, such as using tokens and having a dedicated polling method, but there are some important differences:

    • With #650, the client drives the decision of whether the tool call is async. This means the server cannot make the decision based on input arguments or session state.

    • #650 requires the server to implement an additional form of persistence for tool call results, separate from the message queue it must already implement for resumability.

    • Because tools/async/result only captures the tool call result, #650 effectively requires the client to stream from the GET /mcp endpoint. Otherwise, the client may miss server-sent requests (e.g. sampling requests) that would block tool call progress.

      Thus, #650 is still affected by the same problems listed in this proposal's "Motivation" section. For example, if a disconnection occurs before the client receives an event ID on the GET /mcp endpoint, and the server sends a sampling request, then the tool call would be blocked until it expires because the client would have no way to get the sampling request.

      Furthermore, it begs the question: if the client must stream from that (or any other) endpoint, why not also send the tool call result on that stream? (If the answer is to make the result fetchable separately from the stream, that can be achieved with resource links instead.)

  • #1003: Resume tokens for long-running operations

    Essentially, #1003 is cursor-based pagination of results. In order to benefit from the proposal, a method must divide its result into chunks. Calls to retrieve each chunk are affected by the same problems listed in this proposal's "Motivation" section. If a result is divided enough, the problems could be mitigated, however each chunk will require an additional round trip. Also, #1003 does not apply when a result is indivisible, such as for a long-running computation that computes a singular value.

    Other differences:

    • #1003 assumes client support; it does not define additional client capabilities nor consider them. If a client does not support the proposal, it will only receive the first chunk of the result. If the proposal were to define an additional client capability, it is not clear how result chunks could be automatically combined to support clients without the capability.

      Note: if we decide we want to assume client support, this proposal (#925) can drop the resumableRequests client capability. Everything else will work as expected.

    • With #1003, the only way for a client to check the status of a request is to resume the request. If the server does not return an error, then the request is still ongoing.

      Note: if we decide we don't want to support a dedicated polling mechanism, this proposal (#925) can drop the requests/getStatus method. Everything else will work as expected.

Backwards Compatibility

This feature is backward compatible because clients must opt in by advertising the resumableRequests capability, and servers have no obligation to send a notifications/requests/resumePolicy notification.

Security Implications

The resumeToken that the server issues as part of the notifications/requests/resumePolicy notification should be treated as sensitive information because it can be used to access messages related to the request.

@dsp-ant
Copy link
Member

dsp-ant commented Jul 15, 2025

@connor4312 @jonathanhefner This should be a SEP and should have an associated issue. Since you are maintainers, you are free to set yourself as sponsors (by assigning one of you the issue). Once you have an associated issue, I'll give you a SEP number.

@jonathanhefner jonathanhefner linked an issue Jul 15, 2025 that may be closed by this pull request
@dsp-ant dsp-ant changed the title Support transport-agnostic resumable requests SEP-003: Support transport-agnostic resumable requests Jul 15, 2025

After the resume policy is sent, both the client and the server **MAY** disconnect at will. This allows servers to handle long-running requests without maintaining a constant connection.

After a disconnection, clients can resume the request by sending a [`requests/resume`][] request with the **same ID** as the original request, plus the server-issued token as a parameter. If the ID and token are valid per the resume policy, the server **SHOULD** reset policy-related timers, send any pending messages (e.g., progress notifications), and then continue as if it were handling the original request.
Copy link
Contributor

@connor4312 connor4312 Jul 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a kind of creative flow. I wonder if we could just have this as a notification that contains a result: ServerResult rather than resuming on the exact same ID. That seems like it might involve less special-casing for clients (e.g. no need to 'reserve' event IDs for requests that might get resumed later)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I follow. Are you saying have requests/resume as a notification? Where does result: ServerResult fit in?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean having the response to the resumed request come in the form of a notification. requests/resume would still respond with a success or error, and then the result would later come via a notification like { method: 'notifications/requests/resumedCompleted', params: { resumeToken: 'foo', response: { /* ServerResult */ } } }, rather than being a 'normal' reply reusing the old event ID.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see. In my opinion, that goes in the wrong direction. I would like to make resumable requests be as congruent to normal requests as possible. That way servers can emit messages without being concerned about the request "mode".

For example, a tool may emit a normal response message, but then the delivery fails. When the client resumes the request, the server should be able to just replay the message without rewriting it. Or, for example, servers in a distributed architecture can emit response messages without knowledge of whether the front-end server has disconnected.

Copy link
Contributor

@Joffref Joffref left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, this looks good — I prefer this proposal over the previous one.
Small side question: how should the server react if a request is resumed twice? Do we drop everything on the first resume, or do we keep it?
I know it’s a bit of a silly question, since it would imply two clients trying to get the same response for a tool call that was only initiated once — but it’s still interesting to define the boundary just in case.

* - `"failed"` indicates that the server has a final response, but the
* response is an error.
*/
status: "processing" | "completed" | "failed";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don’t we want a 'pending' status as well? For example, if my request depends on another event, like a pending validation?

I know it's already covered by hasPendingMessage and hasInputRequest, but I’m sure there will be cases where this is done out-of-band.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if we add "pending" there may be an expectation that the server accurately reports whether it is "processing" or "pending". (Hypothetically, a server could be "processing" even when hasInputRequest is true.)

The problem with that expectation is that the state of the server is not observable from just the JSON-RPC messages it emits. For example, if you have a back-end server that is emitting messages, and a front-end server that is answering requests/getStatus, the front-end server wouldn't know whether the back-end server is "processing" or "pending" without some other communication channel (beside the JSON-RPC message queue).

It's doable, but I'm not sure if we want to bake that kind of expectation into the protocol.


After the resume policy is sent, both the client and the server **MAY** disconnect at will. This allows servers to handle long-running requests without maintaining a constant connection.

After a disconnection, clients can resume the request by sending a [`requests/resume`][] request with the **same ID** as the original request, plus the server-issued token as a parameter. If the ID and token are valid per the resume policy, the server **SHOULD** reset policy-related timers, send any pending messages (e.g., progress notifications), and then continue as if it were handling the original request.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to maintain two IDs here? I believe only the requestId is needed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We talked about this during the meeting earlier, but for posterity: Instead of a requestId param, the requests/resume request has the same JSON-RPC message ID as the original request, i.e. the same value for the id property.

I've changed the language to "same message ID as the original request", and I've changed the notation in the diagram to better distinguish between message IDs and params.

@ihrpr
Copy link
Contributor

ihrpr commented Jul 17, 2025

@jonathanhefner is there a prototype PR in any of the SDKs? (ideally Python or Typescript)

Clients and servers can disconnect and reconnect without losing progress.

How this will work for stdio?

This adds a transport-agnostic mechanism for resuming requests after
disconnections.  Using this mechanism:

- Clients and servers can disconnect and reconnect without losing
  progress.
- Servers can communicate expire-after-disconnect timeouts and reclaim
  resources thereafter.
- Clients can check request status after disconnect without having to
  fetch undelivered messages.
- All of the above works regardless of transport (HTTP, WebSocket,
  stdio, etc.).

__Motivation__

- **Addressing limitations of resumability when using the Streamable
    HTTP transport.**
  - SSE-based resume requires the server to send at least one event in
    order for the client to obtain a `Last-Event-ID`.  If a connection
    is lost before an event is sent, there is no way for the client to
    resume the SSE stream.  This is especially problematic because the
    spec currently says that disconnection should not be interpreted as
    the client cancelling its request.
  - The spec does not indicate whether a server can delete previously
    missed SSE events once they have been confirmed delivered by a
    resume.  The spec could explicitly allow this, but resuming is done
    via HTTP GET, and HTTP GET requests should be read-only.
  - There is no mechanism for a server to communicate that it will
    expire a request after a certain duration of client inactivity.
- **Extending resumability to other transports.**
  - Because resumability is defined by the transport layer, the burden
    of creating new or custom transports is higher.
  - If each transport defines its own version of resumability, it is
    more difficult to develop MCP features without accounting for (or
    relying on) the nuances of a particular transport.
- **Enabling robust handling of long-running requests such as tool
    calls.**
  - The spec does not allow servers to close a connection while
    computing a result.  In other words, servers must maintain
    potentially long-running connections.
  - There is no mechanism for a client to check the status of a request
    after disconnection without having to fetch undelivered messages.

This commit addresses the above issues in the following ways:

- The server sends `notifications/requests/resumePolicy` notification as
  soon as possible after determining a request should be resumable.
  This causes the Streamable HTTP transport to send a usable
  `Last-Event-ID` to the client.
- Because a client resumes using a request ID instead of solely an event
  ID, there is no expectation for servers to retain messages that have
  been confirmed delivered.  Furthermore, for the Streamable HTTP
  transport, `requests/resume` is sent via POST, not GET, allowing
  servers to delete delivered messages as part of the resume request.
- The `notifications/requests/resumePolicy` notification includes an
  optional `maxWait` parameter, informing the client of the maximum
  number of seconds it may wait after a disconnection before resuming
  the request or checking its status. After this time has elapsed, the
  server MAY cancel the request and free all associated resources.
- Because resumability is handled at the application layer via
  `notifications/requests/resumePolicy` and `requests/resume`, it works
  the same for all transports.
- After sending a `notifications/requests/resumePolicy` notification,
  the server is allowed to disconnect at will.  Thus the server is not
  required to maintain a long-running connection.
- The client can use `requests/getStatus` to check the status of a
  request after disconnection without having to fetch undelivered
  messages.

Co-authored-by: Connor Peet <connor@peet.io>
@jonathanhefner jonathanhefner requested a review from a team July 17, 2025 21:34
@jonathanhefner
Copy link
Member Author

is there a prototype PR in any of the SDKs? (ideally Python or Typescript)

There is currently no prototype. @connor4312 (or @almaleksia) had a prototype for #899, but this SEP evolved out of that one.

Clients and servers can disconnect and reconnect without losing progress.

How this will work for stdio?

I think that depends on whether the server is actually disconnected or not. For example, if a server is in a container, and the container is suspended, then there isn't really a disconnection. Or, likewise, if the server is a background process that is writing to a named pipe, and the client intermittently stops reading from the pipe, then there isn't really a disconnection. If there is no disconnection, then there is no need to resume a request (via requests/resume).

However, in the case where a stdio server is actually shut down, then I would expect (1) the server to resume processing on boot, and (2) the client to eventually resume the request and receive any queued messages. (By the way, I don't think this flow is specific to stdio. HTTP-based servers may also be subject to reboots. For example, when deploying a new version of the server.)

@dsp-ant dsp-ant changed the title SEP-003: Support transport-agnostic resumable requests SEP-975: Transport-agnostic resumable requests Jul 24, 2025
@jonathanhefner jonathanhefner removed the draft SEP proposal with a sponsor. label Jul 25, 2025
@ihrpr
Copy link
Contributor

ihrpr commented Aug 1, 2025

There was an excellent discussion about this in the working group meeting (see notes here).
The primary concern raised was that this proposal introduces a new resumability concept instead of fixing the existing transport-level mechanisms to work uniformly across all transports. As noted in #984, there's already significant confusion about session handling in MCP. Adding another layer of session-like functionality through resume tokens would compound this confusion rather than resolve it.

I'd like to better understand which use cases wouldn't be addressed if we instead:

  • Lift mcp-session-id to the protocol layer - Making session management inherently transport-agnostic rather than patching transport limitations with new abstractions.
  • Fix the edge case where resumability fails without an initial last-event-id - This directly addresses your valid SSE concern where at least one event must be sent before obtaining a resumable ID.
  • Provide request correlation for all server messages, not just resumable ones, giving us better observability and debugging capabilities as a bonus. (related_request_id is used on transport level as implementation for some of our SDKs)

Could you help identify specific scenarios where this approach might fall short? Understanding these edge cases would help us determine whether we truly need a new abstraction or if we can achieve the same goals by fixing what already exists.

@jonathanhefner
Copy link
Member Author

The primary concern raised was that this proposal introduces a new resumability concept instead of fixing the existing transport-level mechanisms to work uniformly across all transports. As noted in #984, there's already significant confusion about session handling in MCP. Adding another layer of session-like functionality through resume tokens would compound this confusion rather than resolve it.

I'd like to better understand which use cases wouldn't be addressed if we instead:

  • Lift mcp-session-id to the protocol layer - Making session management inherently transport-agnostic rather than patching transport limitations with new abstractions.
  • Fix the edge case where resumability fails without an initial last-event-id - This directly addresses your valid SSE concern where at least one event must be sent before obtaining a resumable ID.
  • Provide request correlation for all server messages, not just resumable ones, giving us better observability and debugging capabilities as a bonus. (related_request_id is used on transport level as implementation for some of our SDKs)

Could you help identify specific scenarios where this approach might fall short?

Sure! With regard to sessions, I agree we should lift session ID into the protocol layer (while also keeping the HTTP header for routing purposes). Around that topic, the Transports WG is working on some specific proposals that clarify the usage of sessions, enable sessions regardless of transport, and better support sessionless / stateless servers. (There has been a strong push from both Google and Microsoft to better support stateless servers.) We can go into details in #984.

With regard to improving the Streamable HTTP transport, we would need to:

  • Mandate that the server immediately send an empty event to provide the client with an event ID (for Last-Event-ID).
  • Mandate that the server send the retry field in the SSE stream to prevent immediate / excessive client reconnects.
  • Change this part of the spec to allow the server to disconnect, in order to avoid long-running connections. (In this proposal, I changed the requirement such that the server may disconnect after sending notifications/requests/resumePolicy.)

After those changes, the remaining shortcomings I can think of are:

  • The server doesn't have a way to communicate the TTL-after-disconnect for a request. (This TTL should be separate from the session TTL.)
    • One possibility would be to declare the TTL-after-disconnect as a server capability, however that would make the TTL static and global. Another possibility would be to declare TTL-after-disconnect as per-tool annotations, however that would only apply to tools instead of all requests. A third possibility would be to support both, with tool annotations taking precedence over the server capability.
  • Because resumes are done via HTTP GET, and HTTP GET should be read-only, the server should retain all messages until the TTL expires, even when it knows (via Last-Event-ID) the client has received them.
  • Because there is no resumeToken, resumability for stateless servers is necessarily tied to the SSE event ID (because it is the only available server-controlled identifier). This means the server's persistence logic will be coupled with the Streamable HTTP transport, which will make it more difficult to support other / multiple / swappable transports. (There is strong demand for alternative transports such as gRPC. I am not saying MCP should officially adopt those transports, but I do think we should make it easier to implement them.)
    • We could try to solve this at the SDK level, but that presumes all transports will have an out-of-band way to communicate server-controlled identifiers.
  • Likewise, alternative transports will have to implement their own logic for minimum retry time and TTL-after-disconnect (unless we implement these as server capabilities / tool annotations).
  • There is no way to poll the status of a request.
    • We could add a separate poll method, however, without something like resumeToken, the poll method would be usable only with sessionful servers, because there would be no server-controlled identifier to use for access control.
  • In SEP-992: Notification Configuration for Tool Call Result #992 (comment), I outlined a mechanism to support callbacks such as webhooks, which there has also been strong demand for. The mechanism piggybacks on the resumableRequests client capability and the notifications/requests/resumePolicy notification from this proposal.
    • Assuming we want to support callbacks, we could implement something similar without this proposal, however, it would still require an additional client capability and an additional server-sent notification.

@jonathanhefner
Copy link
Member Author

This SEP was declined at the core maintainers meeting, but we will try to address some of the relevant concerns in future proposals.

@jonathanhefner jonathanhefner removed the in-review SEP proposal ready for review. label Aug 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

No open projects
Status: Draft

Development

Successfully merging this pull request may close these issues.

SEP-975: Transport-agnostic resumable requests

5 participants