SEP-2001: Optional High Availability Patterns for Stateful Streaming in MCP Deployments #2001

jizhuozhi · 2025-12-20T20:04:18Z

Status: Draft
Type: Informational
Created: 2025-12-21
Author(s): Zhuozhi Ji jizhuozhi.george@gmail.com (@jizhuozhi)
Sponsor: None
PR: SEP-2001: Optional High Availability Patterns for Stateful Streaming in MCP Deployments #2001

Abstract

This SEP proposes optional high availability (HA) best practices for MCP deployments with stateful streaming sessions (e.g., SSE). While the MCP protocol itself remains unchanged, production deployments often face challenges in maintaining session continuity and resilience when using multiple replicas behind load balancers. This proposal outlines optional patterns, including pub-sub event buses, cluster coordination with P2P forwarding, middleware/SDK abstraction, and session partitioning. These patterns provide guidance for implementers to achieve HA without breaking protocol compatibility or requiring client modifications.

Motivation

Production MCP deployments increasingly target multi-node, horizontally scalable environments. Long-lived streaming sessions (SSE) introduce challenges when routed through stateless HTTP ingress or load balancers:

Session continuity may break if connections are routed to a different replica.
Node failure or restart can interrupt ongoing streaming sessions.
Resuming sessions across replicas is non-trivial without coordination.

Community discussions, including GitHub PR #325, have highlighted these issues. Contributors concluded that session stickiness or shared session stores are practical implementation considerations, but not mandated by the protocol. This creates an opportunity for informational guidance on HA patterns that are optional and non-intrusive.

Specification

This SEP does not introduce protocol-level changes. The following optional HA patterns are proposed for implementers:

1. Core HA Patterns

1.1 Event Bus / Pub-Sub

Externalize session events to a distributed pub-sub system.
MCP replicas subscribe to session events to enable failover and session recovery.
Decouples session lifetime from any single node.

1.2 Cluster Coordination & P2P Forwarding

MCP nodes maintain lightweight cluster state via gossip, shared stores, or JDBC ping.
Session messages can be forwarded to the node currently handling the session.
Avoids heavy consensus mechanisms to preserve throughput.

2. Implementation & Optimization Support

2.1 Middleware / SDK Abstraction

Encapsulates HA logic (pub-sub, P2P forwarding) within SDK or middleware.
Keeps protocol handlers and business logic unchanged.
Provides a transparent API to clients, allowing gradual adoption.

2.2 Session Partitioning / Affinity Hints

Session IDs may encode partitioning or affinity hints.
Reduces coordination overhead.
Affinity is advisory and must not impact correctness.

3. Illustrative Middleware-Oriented Model (Python, Non-Normative)

async def handle_mcp_message(message, send):
    if message["type"] == "tool_call":
        result = await run_tool(message["payload"])
        await send({
            "type": "tool_result",
            "payload": result
        })

class MCPHAMiddleware:
    def __init__(self, ha_backend):
        self.ha = ha_backend

    def wrap(self, handler):
        async def wrapped(message, send):
            session_id = self.ha.ensure_session(message)

            async with self.ha.bind_session(session_id, send) as ha_send:
                await handler(message, ha_send)

        return wrapped

Rationale

Alternate designs considered: Sticky sessions at load balancer, full Raft replication, central shared state.
Why chosen approach: Optional patterns allow HA without protocol changes, preserve throughput, and provide flexibility.
Related work: Community PR Add best practices when using load balancer #325; common HA patterns in distributed systems.
Community consensus: PR discussion supports optional, non-normative guidance for HA.

Backward Compatibility

No protocol changes are introduced. Existing clients and servers remain fully compatible. Adoption of HA patterns is optional and implementation-defined.

Security Implications

No new security surfaces are introduced by this SEP. Implementers should consider standard security practices for distributed coordination, pub-sub, and session forwarding.

Reference Implementation

Prototype Python middleware shown above.
No full reference implementation is required to mark SEP as draft.

Additional Optional Sections

Performance Implications

Optional HA patterns may introduce additional latency or coordination overhead, but throughput is preserved by avoiding heavy consensus.

Testing Plan

Implementers should validate session continuity during failover, replica restart, and load balancer routing.

Alternatives Considered

Sticky sessions at LB (less flexible, not always feasible)
Full Raft replication (high latency, throughput penalty)
Central shared store (adds infrastructure complexity)

Open Questions

Best practices for large clusters with thousands of concurrent streaming sessions.
Integration guidance for Streamable HTTP once adoption increases.

Acknowledgments

Community contributors to PR Add best practices when using load balancer #325 for highlighting HA challenges in production MCP deployments.

Motivation and Context

How Has This Been Tested?

Breaking Changes

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)
Documentation update

Checklist

I have read the MCP Documentation
My code follows the repository's style guidelines
New and existing tests pass locally
I have added appropriate error handling
I have added or updated documentation as needed

Additional context

jizhuozhi · 2025-12-22T05:52:33Z

One additional point worth clarifying is where HA responsibility should live.

While PR #206 acknowledges stateful servers and session IDs, it implicitly leaves routing and session continuity to external components (e.g. sticky routing at a proxy or message bus–based routing). In practice, relying on load balancers or proxies to provide correctness guarantees for stateful streaming sessions introduces operational uncertainty and deployment-specific behavior.

This SEP intentionally frames HA as something that can be implemented and controlled by the MCP server itself, rather than being dependent on proxy-level stickiness or opaque middleware behavior. By doing so, MCP servers can provide predictable session affinity, failover handling, and recovery semantics that are consistent across environments, independent of ingress or proxy configuration.

In that sense, the proposal does not replace existing deployment options, but highlights that server-managed HA stickiness is both feasible and preferable for stateful streaming use cases, especially in production-grade, multi-replica deployments.

…in MCP Deployments modelcontextprotocol#2001

jizhuozhi changed the title ~~SEP-0000: Optional High Availability Patterns for Stateful Streaming in MCP Deployments~~ SEP-2001: Optional High Availability Patterns for Stateful Streaming in MCP Deployments Dec 20, 2025

jizhuozhi requested review from a team as code owners December 22, 2025 06:03

SEP-2001: Optional High Availability Patterns for Stateful Streaming …

2deeae0

…in MCP Deployments modelcontextprotocol#2001

jizhuozhi force-pushed the main branch from 543d7e6 to 2deeae0 Compare December 22, 2025 06:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

SEP-2001: Optional High Availability Patterns for Stateful Streaming in MCP Deployments #2001

SEP-2001: Optional High Availability Patterns for Stateful Streaming in MCP Deployments #2001

Uh oh!

jizhuozhi commented Dec 20, 2025 •

edited

Loading

Uh oh!

jizhuozhi commented Dec 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

SEP-2001: Optional High Availability Patterns for Stateful Streaming in MCP Deployments #2001

Are you sure you want to change the base?

SEP-2001: Optional High Availability Patterns for Stateful Streaming in MCP Deployments #2001

Uh oh!

Conversation

jizhuozhi commented Dec 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Abstract

Motivation

Specification

1. Core HA Patterns

1.1 Event Bus / Pub-Sub

1.2 Cluster Coordination & P2P Forwarding

2. Implementation & Optimization Support

2.1 Middleware / SDK Abstraction

2.2 Session Partitioning / Affinity Hints

3. Illustrative Middleware-Oriented Model (Python, Non-Normative)

Rationale

Backward Compatibility

Security Implications

Reference Implementation

Additional Optional Sections

Performance Implications

Testing Plan

Alternatives Considered

Open Questions

Acknowledgments

Motivation and Context

How Has This Been Tested?

Breaking Changes

Types of changes

Checklist

Additional context

Uh oh!

jizhuozhi commented Dec 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jizhuozhi commented Dec 20, 2025 •

edited

Loading